Hello!

Raft

Paxos issue Hard to understand Difficult to build Why collect log independently and combine? Why peer-to-peer? Implementation ^ If a majority has one entry of the current term State Leader Transfer to follower if see higher term Follower issue no RPC Candidate Joint consensus Log entries replicated to all servers (across both configurations) Any server may be leader Agreement require majorities from both old and new config First commit $C_{old,new}$ then commit $C_{new}$ Additional phase for new servers to catch up Leader steps down if not in $C_{new}$ Ignore removed server’s RequestVote if receive real leader’s heartbeat Snapshot Snapshot independently on clients Snapshot and delete old log Leader can send snapshots to lagging clients Client Prevent invoking same command Add unique id form each command in log Read only op?...

AFS + Coda

AFS Vice (Server), Venus (Client) Prototype Stub directories represents portions of the Vice name space located on other servers If file not on that server, search stub to find which server Named file by full pathname no inode Replication read-only replicate for topmost levels Cache Venus ask server for timestamp on open Performance many stat -> bad performance limit to ~20 users one process per client Hard to move file between servers Benchmark Many TestAuth (cache validation) GetFileStat () call CPU Bottleneck Context switch path traversing Unbalanced server load Revise Cache Management Cache dir contents & symbolic links status cache in memory (for stat) data cache in disk Modify directory directly on server Consistency Old: Client ask Server if changed New: Client cache, server promise to notify if change Name Resolution reintroduce two-level name (fid, pathname) Client covert pathname to fid fid: (volume number, vnode number, uniquifier) Low-Level Storage Representation Server: use talbe[vnode number] = inode number (Use vnode number as the index) Overview When client open a file go through each path component put to cache and setup callback (if not existed) Client select server by checking volume number in mapping cache if not in cache, contact any server Semantic Writes to file are immediately visible to process on the same machine but not in the network Flush file change on close other file operation are visible immediately everywhere multiple workstations can perform operation concurrently but need programs to cooperate (if cared) Disadvantages no concurrent read/write across clients no diskless operation building a distributed database is hard latency Coda Availability Volume storage group (VSG) Disconnected operation Scalability Callback-based cache coherence Whole-file caching Place functionality on clients Avoid system-wide rapid change First & Second Class Replication First Class Servers persistent, secure, complete… Second Class Clients Optimistic Vs Pessimistic Pessimistic Client acquire exclusive control Block r/w on others Client acquire shared control allow reading at other replicas Optimistic (Coda use this) Read/write everywhere Deal with conflict later Implementation States Hoarding (Normal) Hoard database + file usage history Hierarchical cache management - parent cannot be remove before child Emulation (Disconnected) Reintegration (Resume connection) Hoarding Hoard Walking Run every 10 min Update name binding (check new file for + entries, which indicate future children need high priority) Restore equilibrium by fetch and evict cache On callback break Files and symbolic link purge the object update on demand or during next hoard walk Directory mark cache as suspicious Emulation modified object has infinite priority Log all changes to log file optimization: multiple write into store Store meta data to recoverable virtual memory (RVM) Replay Algorithm parse log, lock all related files validation and execute (only execute meta data update for store) data transfer for store commit and release locks Conflict during phase 2 of replay, check if storeid if the same if server has new storeid - abort Questions AFS Initial Prototype....

Chain Replication

Strong consistency Guarantees Strong consistency Interface query(objId, opts) update(objId, newVal, opts) Protocol Reply: sent by tail Query: processed at tail Update: directed to the head, and chain to tail compute at head, send state to tail can be non-deterministic Hist_objID = Hist_objID of tail Pending_objId = requests received but not processed by tail Failure master service never fail ([[Paxos]]) reconfigure chain Vs Primary/backup Downtime Chain Head failure Query uninterrupted Update unavailable for 2 message delay Middle failure Query, Update uninterrupted Ongoing update 4 message delay No visible to client Tail failure Query, update unavailable for 2 message delay Primary/Backup Primary failure Query, Update unavailable for 5 message delay Master broadcast to all backup Backup response number of update processed Master broadcast new primary New primary forward all missing update Master notify clients Backup failure Query uninterrupted Update 1 message delay Master notify primary no ACK needed In-class Primary-Backup Replication Solve Ordering Non-determinism Leaves: failure Write: Send to all backups Read: Only on primary Read wait for concurrent write to complete Chain-Replication Goal high availability high throughput strong consistency handle failures fail-stop low mean-time-to-failure easy to add node Non-goals handling slow nodes, misbehaving nodes Wide area network WAN Basic Operations val = query(obj-id): idempotent update(obj-id, newVal): not idempotent can handle more complex functions What to do on loss?...

Failures

Why Do Computers Stop Availability: doing the right thing with in the specified response time Reliability: Not doing the wrong thing Infant mortality: Bugs in new hardware/software Bohrbug/Heisenbug Heisenbug: not happening 100% of the time Fault-tolerant Execution Process isolation with kernel message and not share states Process-pairs (Spawn a Backup process) Lockstep: 2 processes doing exactly same thing Cons: Can’t tolerate Heisenbugs State Checkpointing Send full state to backup after every move Cons: Hard to program Automatic Checkpointing Save all messages and replay to backup when priamry failed Cons: Send a lot of data Delta Checkpointing: Send state delta to backup Cons: hard to program Persistent: lose all state when using backup Questions What is the motivation for this work?...

HA-NFS

server reliability dual-ported disk disk reliability mirroring files on different disks (same server) network reliability network components replication (work load distributed over networks) Goal Transparent fail & recover No penalty when normal No client modification Architecture Each node consists Two servers Each server has 2 network interface & ip Use secondary when impersonating or re-integrating A number of SCSI buses Each disk has one primary server Normal operation Both servers exchange NFS RFS_NULL if failed, ping via network and SCSI take over if both fail Take-over Other server restore file system Change secondary MAC/IP to failed server Re-Integration Failed server turn off primary iface and send request to backup server Backup unmount and reset secondary iface Failed server restore Network Failure When normal Two servers in same node use different network as primary (Load balancing) Servers broadcast heartbeat to network When network fail Client daemon timeout and reroute to alternative path In-class Goal: Replicated, distributed service Why Hard?...