LBFS + Lease

cross-file similarities
- Files often contain a number of segments in common with other files or previous versions of the same file
divide files into chunks and indexes the chunk

Design

only close-to-open consistency

Cache Chunk

Fingerprint every over-lapping 48-byte region
- if last 13 bit of region equal to a magic value, place break point
Enforce min/max chunk size
- 2K/64K

Chunk Database

use first 64bit of SHA-1 as key
(file, offset, count) as value
only as a hint

READ

Client GETHASH -> Server
Server response a vector of hashes -> Client
Client request missing data

WRITE

Implementation

Notes

Motivation
- File system for low-bandwidth networks
  - Existing solutions
    - local copy
      - work local copy
      - copy to server when done
      - manual, mistakes, conflict
    - work remotely
      - ssh remote machine
Goal: Min bandwidth
- Related technique: compression
Workload Assumptions
- make small changes to files, similar versions
- e.g.
  - auto-save file
  - object file
  - revision control
Related Work
- AFS: How does it reduce network traffic?
  - caching with callbacks
- Leases: Modification to callbacks (not having a file indefinitely)
  - + reduces state on server
  - + fault tolerance
- NFSv4: Reduce network traffic?
  - Batch rpc-compound
- Coda: Works in extreme of no bandwidth
  - disconnected: logs changes on client
  - reconnected: sends updates
  - conflict => manually
  - bandwidth saving: don’t send deleted files, rewritten copy
LBFS: Basic idea
- Don’t ssend chunks over network if client or server already has it
- Use SHA-1 to uniquely identify chunk
- Option 1: Aligned, non-overlapping chunks
  - Pro: Small # of hashes track
  - Con: hard to find similar chunks
- Option 2: Sliding-window of overlapping chunks
  - Pros: easy to find similarity
  - Cons:
    - Overhead for storing/tracking hashes
    - 1 change -> impact many hashes
- Rsync: combination
  - Client: fixed size chunk
  - Server: Sliding window
  - Limitation
    - makes assumption of file name
    - sliding window: expensive
- LBFS Chunk: Not fixed size
  - 48 bytes sliding window
  - Compute Rabin’s fingerprint over 48-bytes
    - last 13bit = magic number => break point
  - Probable size of each chunk
    - 2**13 possibility = 8KB
  - Represent chunk by sha-1 hash
- Attractive properties of Rabin’s finger print
  - common cases:
    - add/delete data to a chunk, does not effect breakpoint
      - recalculate hash
    - Add breakpoint
      - 2 chunk recompute sha-1
    - Remove breakpoint
      - merge recompute sha1
- Practical issues
  - too small or too large chunks
  - define min + max chunk size
    - suppress breakpoints < min
Chunk Database on client + server
- map 64-bit key -> (file, offset, count)
- Update on all file changes
  - hint
  - recheck file contents on access
What cache consistency does LBFS provide
- Goal: Avoid unnecessary traffic with server
  - callbacks
- Why whole file? (AFS)
  - easier chunk calculation
- Open:
  - Check if data cached
  - Check if callback broke
  - Contact server if no lease
- Close:
  - Writes to serer
  - Serer notify other clients of changes
Why are leases useful?
- Server state:
  - drop knowledge past lease expiration
- Fault tolerance
  - Client crashes, disconnected
  - Server knows client doesn’t have leases after expires
- Leases: Impact of synchronized clocks
  - Implications
    - network latency $\lt \epsilon$
    - clocks run at same rate
    - ok if client clock running too fast or server clock going too slow
Protocol
- Read - collision
  - use wrong data with same sha-1
- Write - when worse?
  - small files
  - everything has change
Security
- Leak someone has specific data in some file
Conclusion
- Paper
  - Practical implementation on NFS
  - Builds nicely from NFSv4, AFS, Leases
- Idea
  - Using SHA-1 hashes to identify content + content-addressable storage + deduplication
  - db hint or absolute truth
  - Using Rabin’s fingerprints for defining chunks - great properties when adding/del data

Design #

Cache Chunk #

Chunk Database #

READ #

WRITE #

Implementation #

Notes #

Design

Cache Chunk

Chunk Database

READ

WRITE

Implementation

Notes