Assumptions

  • Component failures / Fast Recovery
  • Optimize for large files
  • Large streaming read / small random reads
  • Large sequential writes
  • Atomic append
  • High bandwidth
    • Latency => not as important

Architecture

  • Single master server, multiple chuckservers
  • chunk
    • chunkhandle - assigned by master
    • store chunk on local disks as Linux files
    • chunk replicated on 3+ servers
  • master
    • maintains all metadata
      • (ACL, file <-> chunk mapping, location …)
      • garbage collect, lease management …
  • Operation
    • Client sends (filename, chunk index) to master
    • Master replies (chunk handle, server location)
    • Client goes to chunk server

Metadata

  1. file and chunk namespaces
  2. file <=> chunks mapping
  3. locations of each chunk’s replicas
  • store in memory
  • 1, 2 persisted on disk
  • 3 ask chunkservers on startup

Consistency Model

  • Record append
    • appended atomically at least once
  • Client cache chunkserver location
    • may read stale data

Order

  • Order within the same chunk across different servers
  • Master give chunk lease to a primary replica
  • Primary picks serial order for all mutations

Snapshot

  • Copy-on-write
  • Client sends snapshot request to master
  • Master revokes leases
  • Master duplicate metadata
  • Client wants to write C, ask master for replica location
  • Master pick new chunk handle C'
  • Ask all replica to create C'

Garbage Collect

  • Soft delete (Rename file)
  • Collect garbage in background

Stale Replica Detection

  • Assign chunk version number whenever master grants new lease