CS739

Bigtable

Wide applicability, scalability, high performance, high availability Data Model Rows Every read or write of data under a single row key is atomic Rows sorted in lexicographic order tablet: row range Columns Grouped into columns family Column key = family:qualifier Access control on family level Timestamp Store multiple versions in one cell Implementation Master assigning tablets to tablet servers detect the addition and expiration of tablet servers load balancing garbage collection Tablet location Read chubby file -> get location of the root tablet Root tablet: contains all Metadata tablet location METADATA table Row key: (tablet's table id, end row) Data: location of data tablet, other metadata Tablet Assignment Master Tablet servers create file and lock in chubby Master detect tablet alive by asking tablet server If timeout or server report lost lock Master try lock server file If lock success => chubby up, server down => delete server file server commit suicide If can’t reach chubby => master kill itself On master boot Master scan chubby and ask every server to know tablet <=> server mapping Servers split table => commit to METADATA table => notify master Tablet Serving memtable => new updates SSTable => old updates Optimizations Locality groups One SSTable for each locality group (column family) Compression Cache Bloom filter reduce SSTable access by asking if data is in that SSTable Commit-log all servers append commit log to same physical file

Dynamo

Assumptions KV Store Small object (< 1MB) Week consistency Consideration Always writeable Conflict resolution during read Application can implement its own merge policy Or use simple policy the data store provides (last write win) Architecture Interface get(key) -> (Array[object], context) 1 object or many conflicting objects put(key, context, object) context: vector clock and other metadata Partitioning Each node hash to a location on a ring Data hash to another location on ring rotate clockwise, find nearest node Improvement: Each node hash to many virtual node on the ring Replication Replicates at N-1 clockwise successor nodes preference list - The list of nodes that is responsible for storing a particular key Versioning Vector clock compression Remove oldest clock when exceeds threshold get()/put() Set $R$, $W$ Write to a coordinator (nearest or load balancer) coordinator send to top $N$ in preference list wait for $W-1$ replies ($W$ in total) return Read from $N$ nodes, wait for $R$ responses Failures When a server is down Send data to next healthy note with a hint When backup send data back to previously failed node Membership change Require manual actions Each node contacts a peer at random every second to sync membership/partition changes Detect failure locally ?...

GFS

Assumptions Component failures / Fast Recovery Optimize for large files Large streaming read / small random reads Large sequential writes Atomic append High bandwidth Latency => not as important Architecture Single master server, multiple chuckservers chunk chunkhandle - assigned by master store chunk on local disks as Linux files chunk replicated on 3+ servers master maintains all metadata (ACL, file <-> chunk mapping, location …) garbage collect, lease management … Operation Client sends (filename, chunk index) to master Master replies (chunk handle, server location) Client goes to chunk server Metadata file and chunk namespaces file <=> chunks mapping locations of each chunk’s replicas store in memory 1, 2 persisted on disk 3 ask chunkservers on startup Consistency Model Record append appended atomically at least once Client cache chunkserver location may read stale data Order Order within the same chunk across different servers Master give chunk lease to a primary replica Primary picks serial order for all mutations Snapshot Copy-on-write Client sends snapshot request to master Master revokes leases Master duplicate metadata Client wants to write C, ask master for replica location Master pick new chunk handle C' Ask all replica to create C' Garbage Collect Soft delete (Rename file) Collect garbage in background Stale Replica Detection Assign chunk version number whenever master grants new lease

Nil-Ext

Nil-externality Does not return an execution result or execution error validation error => OK Nilext-aware Replication Normal (Paxos) Write: 2RTT Read: Read from master (1 RTT) Nilext-aware Client send request to all replicas receive replies from super majority and master Other approaches Network Ordering Special underlying network Speculative Execution Execute and rollback Exploiting Commutativity SKYROS Update Send to all replicas Wait for supermajority $f + \lceil f/2 \rceil + 1$ with leader included Replica write to Durability log and return Background Ordering and Execution Batch durability log Leader adds to consensus log Leader sends prepare to $f$ followers, followers add to consensus log Leader applies update and removes from durability log Leader sends commit to followers Leader applies update and removes from durability log Read If in durability log adds all request from the d-log, wait for $f$ followers to respond Serve the read Else directly return data Non-nilext Updates Client sends to leader only Leader add all and this update to consensus log Wait for $f$ followers => return Recovery Replica Fail Mark status as recovering Send recovery, wait for $f+1$ replies including leader from latest view copy d-log, c-log Master Fail C-Log Find latest log with largest view # D-Log

Linearizability, Occult

Linearizable Reusable Infrastructure for Linearizability At least once => exactly once Assume system with RPC Underlying RPC provide at-least-once semantics Not abort after server crash Architecture RPC ID Assigned by clients | ClientID | Seq number| Completion record durability RPC ID + result atomically create with update Retry rendezvous Retry must replied with previous result Might retry from another server => need to ensure completion record are there Store/Move completion record with associate object Garbage Collection Server can remove completion record when: Client ack server response Client crashes Lease Each client has private lease Client ID = Lease ID Client crash => Lease expire => reboot => new client ID Design Detail Server upon receiving an RPC => Check duplicate Normal case: NEW Execute RPC COMPLETED return old result IN_PROGRESS discard or notify client STALE retry msg get delayed after client ACK return error Create Completion record RPC ID, Object ID, result create atomically with update mark COMPLETED and persist on disk before return Client receive reply Client mark this seq num as completed On next RPC call: Piggyback min incomplete seq number Server garbage collect item with smaller seq number LeaseServer Zoo Keeper Scalability Don’t store lease expire time on disk (Only the existence of lease) Validation Lease server implements cluster clock Client get clock value when renew lease include clock in RPC call Server checks lease server if lease close to expire Transaction Use RIFL for requestAbort & prepare Normal case Client send prepare for each object’s server (participant) Servers acquire object locks return PREPARED, ABORT Client send decision Client Crash First participant become recovery coordinator Send requestAbort (with same RPC ID as prepare) Receive PREPARED (Already prepared) ABORT Coordinator send decision Garbage Collect Client only mark prepare completed after receiving decsion response Occult Prevent slowdown cascades Observable Causal Consistency Old: Evolve data store only through monotonically non-decreasing update Occult: Let clients decide when to read safely Framework Key-Value store Divided to shard A master server per shard, multiple slaves Client write to master, read from any shardstamp for each shard causal timestamp = vector of shardstamp Write Client attach causal timestamp, send to master Master increase shardstamp, stored with new value, return Client update its causal timestamp Read Server return value and causal timestamp Client check if greater than local timestamp If no => retry or go to master Client update local timestamp Transaction Guarantees Observe a consistent snapshot no write-write conflicts transactions in the same client session must be totally ordered Protocol to execute a transaction $T$...