Hello!

GFS

Assumptions Component failures / Fast Recovery Optimize for large files Large streaming read / small random reads Large sequential writes Atomic append High bandwidth Latency => not as important Architecture Single master server, multiple chuckservers chunk chunkhandle - assigned by master store chunk on local disks as Linux files chunk replicated on 3+ servers master maintains all metadata (ACL, file <-> chunk mapping, location …) garbage collect, lease management … Operation Client sends (filename, chunk index) to master Master replies (chunk handle, server location) Client goes to chunk server Metadata file and chunk namespaces file <=> chunks mapping locations of each chunk’s replicas store in memory 1, 2 persisted on disk 3 ask chunkservers on startup Consistency Model Record append appended atomically at least once Client cache chunkserver location may read stale data Order Order within the same chunk across different servers Master give chunk lease to a primary replica Primary picks serial order for all mutations Snapshot Copy-on-write Client sends snapshot request to master Master revokes leases Master duplicate metadata Client wants to write C, ask master for replica location Master pick new chunk handle C' Ask all replica to create C' Garbage Collect Soft delete (Rename file) Collect garbage in background Stale Replica Detection Assign chunk version number whenever master grants new lease

Nil-Ext

Nil-externality Does not return an execution result or execution error validation error => OK Nilext-aware Replication Normal (Paxos) Write: 2RTT Read: Read from master (1 RTT) Nilext-aware Client send request to all replicas receive replies from super majority and master Other approaches Network Ordering Special underlying network Speculative Execution Execute and rollback Exploiting Commutativity SKYROS Update Send to all replicas Wait for supermajority $f + \lceil f/2 \rceil + 1$ with leader included Replica write to Durability log and return Background Ordering and Execution Batch durability log Leader adds to consensus log Leader sends prepare to $f$ followers, followers add to consensus log Leader applies update and removes from durability log Leader sends commit to followers Leader applies update and removes from durability log Read If in durability log adds all request from the d-log, wait for $f$ followers to respond Serve the read Else directly return data Non-nilext Updates Client sends to leader only Leader add all and this update to consensus log Wait for $f$ followers => return Recovery Replica Fail Mark status as recovering Send recovery, wait for $f+1$ replies including leader from latest view copy d-log, c-log Master Fail C-Log Find latest log with largest view # D-Log

Linearizable

Each operation appears to occur instantaneously and exactly once

Linearizability, Occult

Linearizable Reusable Infrastructure for Linearizability At least once => exactly once Assume system with RPC Underlying RPC provide at-least-once semantics Not abort after server crash Architecture RPC ID Assigned by clients | ClientID | Seq number| Completion record durability RPC ID + result atomically create with update Retry rendezvous Retry must replied with previous result Might retry from another server => need to ensure completion record are there Store/Move completion record with associate object Garbage Collection Server can remove completion record when: Client ack server response Client crashes Lease Each client has private lease Client ID = Lease ID Client crash => Lease expire => reboot => new client ID Design Detail Server upon receiving an RPC => Check duplicate Normal case: NEW Execute RPC COMPLETED return old result IN_PROGRESS discard or notify client STALE retry msg get delayed after client ACK return error Create Completion record RPC ID, Object ID, result create atomically with update mark COMPLETED and persist on disk before return Client receive reply Client mark this seq num as completed On next RPC call: Piggyback min incomplete seq number Server garbage collect item with smaller seq number LeaseServer Zoo Keeper Scalability Don’t store lease expire time on disk (Only the existence of lease) Validation Lease server implements cluster clock Client get clock value when renew lease include clock in RPC call Server checks lease server if lease close to expire Transaction Use RIFL for requestAbort & prepare Normal case Client send prepare for each object’s server (participant) Servers acquire object locks return PREPARED, ABORT Client send decision Client Crash First participant become recovery coordinator Send requestAbort (with same RPC ID as prepare) Receive PREPARED (Already prepared) ABORT Coordinator send decision Garbage Collect Client only mark prepare completed after receiving decsion response Occult Prevent slowdown cascades Observable Causal Consistency Old: Evolve data store only through monotonically non-decreasing update Occult: Let clients decide when to read safely Framework Key-Value store Divided to shard A master server per shard, multiple slaves Client write to master, read from any shardstamp for each shard causal timestamp = vector of shardstamp Write Client attach causal timestamp, send to master Master increase shardstamp, stored with new value, return Client update its causal timestamp Read Server return value and causal timestamp Client check if greater than local timestamp If no => retry or go to master Client update local timestamp Transaction Guarantees Observe a consistent snapshot no write-write conflicts transactions in the same client session must be totally ordered Protocol to execute a transaction $T$...

Bayou

Weakly consistent May return conflicting data (Will let clients know) Supporting application-specific conflict detection and resolution Example App Meeting room scheduler Users select several acceptable meeting times Reservation may be “tentative” System Model Client write to any server Server assign WriteID Servers propagate writes during pair-wise contact aka anti-entropy sessions Dependency Check Bayou_Write (update, dependency_check, mergeproc) Write consist application-supplied query and expected result unexpected result => conflict => apply merge procedures Replica Consistency Writes are performed in the same order at all servers Conflict detection and merge procedures are deterministic Servers need to know how to roll back tentative execution Write Stability Primary commit Which server to choose as primary?...