- cross-file similarities
- Files often contain a number of segments in common with other files or previous versions of the same file
- divide files into chunks and indexes the chunk
Design #
- only close-to-open consistency
Cache Chunk #
- Fingerprint every over-lapping 48-byte region
- if last 13 bit of region equal to a magic value, place break point
- Enforce min/max chunk size
Chunk Database #
- use first 64bit of SHA-1 as key
- (file, offset, count) as value
- only as a hint
READ #
- Client
GETHASH -> Server - Server response a vector of hashes -> Client
- Client request missing data
WRITE #

Implementation #

Notes #
- Motivation
- File system for low-bandwidth networks
- Existing solutions
- local copy
- work local copy
- copy to server when done
- manual, mistakes, conflict
- work remotely
- Goal: Min bandwidth
- Related technique: compression
- Workload Assumptions
- make small changes to files, similar versions
- e.g.
- auto-save file
- object file
- revision control
- Related Work
- AFS: How does it reduce network traffic?
- Leases: Modification to callbacks (not having a file indefinitely)
- + reduces state on server
- + fault tolerance
- NFSv4: Reduce network traffic?
- Coda: Works in extreme of no bandwidth
- disconnected: logs changes on client
- reconnected: sends updates
- conflict => manually
- bandwidth saving: don’t send deleted files, rewritten copy
- LBFS: Basic idea
- Don’t ssend chunks over network if client or server already has it
- Use SHA-1 to uniquely identify chunk
- Option 1: Aligned, non-overlapping chunks
- Pro: Small # of hashes track
- Con: hard to find similar chunks
- Option 2: Sliding-window of overlapping chunks
- Pros: easy to find similarity
- Cons:
- Overhead for storing/tracking hashes
- 1 change -> impact many hashes
- Rsync: combination
- Client: fixed size chunk
- Server: Sliding window
- Limitation
- makes assumption of file name
- sliding window: expensive
- LBFS Chunk: Not fixed size
- 48 bytes sliding window
- Compute Rabin’s fingerprint over 48-bytes
- last 13bit = magic number => break point
- Probable size of each chunk
- Represent chunk by sha-1 hash
- Attractive properties of Rabin’s finger print
- common cases:
- add/delete data to a chunk, does not effect breakpoint
- Add breakpoint
- Remove breakpoint
- Practical issues
- too small or too large chunks
- define min + max chunk size
- suppress breakpoints < min
- Chunk Database on client + server
- map 64-bit key -> (file, offset, count)
- Update on all file changes
- hint
- recheck file contents on access
- What cache consistency does LBFS provide
- Goal: Avoid unnecessary traffic with server
- Why whole file? (AFS)
- Open:
- Check if data cached
- Check if callback broke
- Contact server if no lease
- Close:
- Writes to serer
- Serer notify other clients of changes
- Why are leases useful?
- Server state:
- drop knowledge past lease expiration
- Fault tolerance
- Client crashes, disconnected
- Server knows client doesn’t have leases after expires
- Leases: Impact of synchronized clocks
- Implications
- network latency $\lt \epsilon$
- clocks run at same rate
- ok if client clock running too fast or server clock going too slow
- Protocol
- Read - collision
- use wrong data with same sha-1
- Write - when worse?
- small files
- everything has change
- Security
- Leak someone has specific data in some file
- Conclusion
- Paper
- Practical implementation on NFS
- Builds nicely from NFSv4, AFS, Leases
- Idea
- Using SHA-1 hashes to identify content + content-addressable storage + deduplication
- db hint or absolute truth
- Using Rabin’s fingerprints for defining chunks - great properties when adding/del data
comments powered by