• cross-file similarities
    • Files often contain a number of segments in common with other files or previous versions of the same file
  • divide files into chunks and indexes the chunk

Design

  • only close-to-open consistency

Cache Chunk

  • Fingerprint every over-lapping 48-byte region
    • if last 13 bit of region equal to a magic value, place break point
  • Enforce min/max chunk size
    • 2K/64K

Chunk Database

  • use first 64bit of SHA-1 as key
  • (file, offset, count) as value
  • only as a hint

READ

  • Client GETHASH -> Server
  • Server response a vector of hashes -> Client
  • Client request missing data

WRITE

Implementation

Notes

  • Motivation
    • File system for low-bandwidth networks
      • Existing solutions
        • local copy
          • work local copy
          • copy to server when done
          • manual, mistakes, conflict
        • work remotely
          • ssh remote machine
  • Goal: Min bandwidth
    • Related technique: compression
  • Workload Assumptions
    • make small changes to files, similar versions
    • e.g.
      • auto-save file
      • object file
      • revision control
  • Related Work
    • AFS: How does it reduce network traffic?
      • caching with callbacks
    • Leases: Modification to callbacks (not having a file indefinitely)
      • + reduces state on server
      • + fault tolerance
    • NFSv4: Reduce network traffic?
      • Batch rpc-compound
    • Coda: Works in extreme of no bandwidth
      • disconnected: logs changes on client
      • reconnected: sends updates
      • conflict => manually
      • bandwidth saving: don’t send deleted files, rewritten copy
  • LBFS: Basic idea
    • Don’t ssend chunks over network if client or server already has it
    • Use SHA-1 to uniquely identify chunk
    • Option 1: Aligned, non-overlapping chunks
      • Pro: Small # of hashes track
      • Con: hard to find similar chunks
    • Option 2: Sliding-window of overlapping chunks
      • Pros: easy to find similarity
      • Cons:
        • Overhead for storing/tracking hashes
        • 1 change -> impact many hashes
    • Rsync: combination
      • Client: fixed size chunk
      • Server: Sliding window
      • Limitation
        • makes assumption of file name
        • sliding window: expensive
    • LBFS Chunk: Not fixed size
      • 48 bytes sliding window
      • Compute Rabin’s fingerprint over 48-bytes
        • last 13bit = magic number => break point
      • Probable size of each chunk
        • 2**13 possibility = 8KB
      • Represent chunk by sha-1 hash
    • Attractive properties of Rabin’s finger print
      • common cases:
        • add/delete data to a chunk, does not effect breakpoint
          • recalculate hash
        • Add breakpoint
          • 2 chunk recompute sha-1
        • Remove breakpoint
          • merge recompute sha1
    • Practical issues
      • too small or too large chunks
      • define min + max chunk size
        • suppress breakpoints < min
  • Chunk Database on client + server
    • map 64-bit key -> (file, offset, count)
    • Update on all file changes
      • hint
      • recheck file contents on access
  • What cache consistency does LBFS provide
    • Goal: Avoid unnecessary traffic with server
      • callbacks
    • Why whole file? (AFS)
      • easier chunk calculation
    • Open:
      • Check if data cached
      • Check if callback broke
      • Contact server if no lease
    • Close:
      • Writes to serer
      • Serer notify other clients of changes
  • Why are leases useful?
    • Server state:
      • drop knowledge past lease expiration
    • Fault tolerance
      • Client crashes, disconnected
      • Server knows client doesn’t have leases after expires
    • Leases: Impact of synchronized clocks
      • Implications
        • network latency $\lt \epsilon$
        • clocks run at same rate
        • ok if client clock running too fast or server clock going too slow
  • Protocol
    • Read - collision
      • use wrong data with same sha-1
    • Write - when worse?
      • small files
      • everything has change
  • Security
    • Leak someone has specific data in some file
  • Conclusion
    • Paper
      • Practical implementation on NFS
      • Builds nicely from NFSv4, AFS, Leases
    • Idea
      • Using SHA-1 hashes to identify content + content-addressable storage + deduplication
      • db hint or absolute truth
      • Using Rabin’s fingerprints for defining chunks - great properties when adding/del data