Chubby

Rationale Easier to migrate (from non-high-available to HA) Store/Fetch small files More familiar to programmers Reduce number of servers Lock Coarse-grained Hold for hours or days Lock-acquisition rate weakly related to client transaction rate Rare acquisition ->Temp lock server unavailability -> No big deal Should not lost lock when lock server reboot System Structure Client + Lib (Link) <-> Server (5 node) Chubby servers called replicas elect a master ask any replicas where the master is all replicas maintain identical database Only master read/write ack write after majority write read by master only Add new replica is one does not recover after a few hours Files Name ex: /ls/foo/wombot/pouch ls: common prefix (lock service) foo: name of a Chubby cell special name: local (local chubby cell) Operations Only whole file read No moving files from one directory (may served by different Chubby master) No soft/hard link Ephemeral Node (tmp files) Metadata ^a9d075 instance number: greater than any previous with same name content generation (files only): change when file are written lock generation number: increase when lock free -> hold ACL generation number: increase when ACL name change 3 ACL Name Checksum File handle ^e25fcd Check digit (No forging handle) Sequence number (if handle generate by previous master?...

March 18, 2022

Raft

Paxos issue Hard to understand Difficult to build Why collect log independently and combine? Why peer-to-peer? Implementation ^ If a majority has one entry of the current term State Leader Transfer to follower if see higher term Follower issue no RPC Candidate Joint consensus Log entries replicated to all servers (across both configurations) Any server may be leader Agreement require majorities from both old and new config First commit $C_{old,new}$ then commit $C_{new}$ Additional phase for new servers to catch up Leader steps down if not in $C_{new}$ Ignore removed server’s RequestVote if receive real leader’s heartbeat Snapshot Snapshot independently on clients Snapshot and delete old log Leader can send snapshots to lagging clients Client Prevent invoking same command Add unique id form each command in log Read only op?...

February 25, 2022

AFS + Coda

AFS Vice (Server), Venus (Client) Prototype Stub directories represents portions of the Vice name space located on other servers If file not on that server, search stub to find which server Named file by full pathname no inode Replication read-only replicate for topmost levels Cache Venus ask server for timestamp on open Performance many stat -> bad performance limit to ~20 users one process per client Hard to move file between servers Benchmark Many TestAuth (cache validation) GetFileStat () call CPU Bottleneck Context switch path traversing Unbalanced server load Revise Cache Management Cache dir contents & symbolic links status cache in memory (for stat) data cache in disk Modify directory directly on server Consistency Old: Client ask Server if changed New: Client cache, server promise to notify if change Name Resolution reintroduce two-level name (fid, pathname) Client covert pathname to fid fid: (volume number, vnode number, uniquifier) Low-Level Storage Representation Server: use talbe[vnode number] = inode number (Use vnode number as the index) Overview When client open a file go through each path component put to cache and setup callback (if not existed) Client select server by checking volume number in mapping cache if not in cache, contact any server Semantic Writes to file are immediately visible to process on the same machine but not in the network Flush file change on close other file operation are visible immediately everywhere multiple workstations can perform operation concurrently but need programs to cooperate (if cared) Disadvantages no concurrent read/write across clients no diskless operation building a distributed database is hard latency Coda Availability Volume storage group (VSG) Disconnected operation Scalability Callback-based cache coherence Whole-file caching Place functionality on clients Avoid system-wide rapid change First & Second Class Replication First Class Servers persistent, secure, complete… Second Class Clients Optimistic Vs Pessimistic Pessimistic Client acquire exclusive control Block r/w on others Client acquire shared control allow reading at other replicas Optimistic (Coda use this) Read/write everywhere Deal with conflict later Implementation States Hoarding (Normal) Hoard database + file usage history Hierarchical cache management - parent cannot be remove before child Emulation (Disconnected) Reintegration (Resume connection) Hoarding Hoard Walking Run every 10 min Update name binding (check new file for + entries, which indicate future children need high priority) Restore equilibrium by fetch and evict cache On callback break Files and symbolic link purge the object update on demand or during next hoard walk Directory mark cache as suspicious Emulation modified object has infinite priority Log all changes to log file optimization: multiple write into store Store meta data to recoverable virtual memory (RVM) Replay Algorithm parse log, lock all related files validation and execute (only execute meta data update for store) data transfer for store commit and release locks Conflict during phase 2 of replay, check if storeid if the same if server has new storeid - abort Questions AFS Initial Prototype....

February 24, 2022

Chain Replication

Strong consistency Guarantees Strong consistency Interface query(objId, opts) update(objId, newVal, opts) Protocol Reply: sent by tail Query: processed at tail Update: directed to the head, and chain to tail compute at head, send state to tail can be non-deterministic Hist_objID = Hist_objID of tail Pending_objId = requests received but not processed by tail Failure master service never fail ([[Paxos]]) reconfigure chain Vs Primary/backup Downtime Chain Head failure Query uninterrupted Update unavailable for 2 message delay Middle failure Query, Update uninterrupted Ongoing update 4 message delay No visible to client Tail failure Query, update unavailable for 2 message delay Primary/Backup Primary failure Query, Update unavailable for 5 message delay Master broadcast to all backup Backup response number of update processed Master broadcast new primary New primary forward all missing update Master notify clients Backup failure Query uninterrupted Update 1 message delay Master notify primary no ACK needed In-class Primary-Backup Replication Solve Ordering Non-determinism Leaves: failure Write: Send to all backups Read: Only on primary Read wait for concurrent write to complete Chain-Replication Goal high availability high throughput strong consistency handle failures fail-stop low mean-time-to-failure easy to add node Non-goals handling slow nodes, misbehaving nodes Wide area network WAN Basic Operations val = query(obj-id): idempotent update(obj-id, newVal): not idempotent can handle more complex functions What to do on loss?...

February 24, 2022

Failures

Why Do Computers Stop Availability: doing the right thing with in the specified response time Reliability: Not doing the wrong thing Infant mortality: Bugs in new hardware/software Bohrbug/Heisenbug Heisenbug: not happening 100% of the time Fault-tolerant Execution Process isolation with kernel message and not share states Process-pairs (Spawn a Backup process) Lockstep: 2 processes doing exactly same thing Cons: Can’t tolerate Heisenbugs State Checkpointing Send full state to backup after every move Cons: Hard to program Automatic Checkpointing Save all messages and replay to backup when priamry failed Cons: Send a lot of data Delta Checkpointing: Send state delta to backup Cons: hard to program Persistent: lose all state when using backup Questions What is the motivation for this work?...

February 24, 2022