CS739

Bayou

Weakly consistent May return conflicting data (Will let clients know) Supporting application-specific conflict detection and resolution Example App Meeting room scheduler Users select several acceptable meeting times Reservation may be “tentative” System Model Client write to any server Server assign WriteID Servers propagate writes during pair-wise contact aka anti-entropy sessions Dependency Check Bayou_Write (update, dependency_check, mergeproc) Write consist application-supplied query and expected result unexpected result => conflict => apply merge procedures Replica Consistency Writes are performed in the same order at all servers Conflict detection and merge procedures are deterministic Servers need to know how to roll back tentative execution Write Stability Primary commit Which server to choose as primary?...

Consistency

6 Consistency Guarantees Strong Consistency See all previous writes Eventual Consistency See (any) subset of previous writes Weakest Consistency Prefix See an ordered sequence of writes starting with the first write The read result exist at some point in the master Bounded Staleness See all “old” writes e.g. See all writes more than 5 minutes ago Monotonic Reads See increasing subset of writes Similar to eventual consistency Later read return more recently value Read my write See all writes performed by reader Strength of consistency guarantee Defined by the size of the set of allowable results Strong consistency Set size = 1 (only the latest value) Eventual Consistency Set size = large (any value) In-Class Linearizability Strong consistency, single copy All ops seen in same global order Global order determined by real time if a completes before b begins, a ordered before b else a + b are concurrent Simpler to reason about Poor performance Sequential Consistency All ops totally ordered but no real-time ordering linearizable w/o realtime Ops from one client are ordered in program order Causal Consistency No total order seen by all clients Causally-related ops: same order observed by all clients

Byzantine

Sustain $\lfloor \frac{n-1}{3} \rfloor$ failures $f$ don’t response $f$ response but cannot tell if faulty node respond or not need $f$ + 1 to out number Provide safety and liveness Safety: all non faulty replicas agree on a total order for the execution of requests despite failures. System Model Network drop, delay, out of order, duplicate Faulty node behave arbitrarily Independent node failures $D(m)$ = digest of m $\langle m \rangle _{\sigma_i}$ = message sign by node $i$ Adversary can coordinate faulty nodes delay correct nodes cannot delay correct nodes indefinitely quantum compute Algorithm Terms replicas move through a succession of view primary of a view is $p = v \mod |R|$ $p$ primary replica $v$ view number $|R|$ total replicas $f$ max # to fail Replica = all servers Primary, Backup Process Client send request to primary primary multicasts all backups Replicas execute request and reply to client client wait for $f + 1$ same replies Client Client $c$ send $\langle REQUEST, o, t, c \rangle_{\sigma_c}$ to master $t$ timestamp $o$ operation to be executed Master broadcast to backups Replicas reply $\langle REPLY,v,t,c,i,r \rangle$ $r$ result $i$ replica number $v$ view number Client wait for $f + 1$ replies with same $t$ and $r$ If timeout Client broadcast request to all replica If already processed re-send reply (remember) If not processed & not primary redirect to primary Normal Case Replica state message log current view Three-phase atomic broadcast Pre-prepare Multicast to all backup and write to log $\langle \langle PRE-PREAPRE, v, n, d \rangle_{\sigma_p}, m \rangle$ $m$ is the client’s request (message)?...

EPaxos

Paxos issues Also apply to Chubby ZooKeeper All requests go to master Bad scalability High latency when geo-replicate (remote master) Sensitive to load spikes & network delay Master down -> system down (for a while) EPaxos Goal Optimal commit latency in the wide area Optimal load balancing across all replicas Graceful performance degradation So… Every replica need to act as proposers or else the latency will be high imply no leader Minimize proposer’s communication with remote Quorum composition must be flexible to avoid slow nodes Key Ordering Old: Leader choose order or choose pre ordered command slot EPaxos: dynamic & decentralized Not necessary to enforce a consistent ordering for non-interfering commands Non-interfering 1 RTT Need fast-past quorum of node $F + \lfloor\frac{F+1}{2}\rfloor$ $F$ = min# tolerable node failure R1: PreAccept C1 R5: PreAccept C2 R2, 3, 4: OK => Commit Interfering 2 RTT quorum size $F+1$ R5: PreAccept C4 R3: OK C4 R1: PreAccept C3 R3: OK C3 should go after C4 R1: Receive inconsistent response, second phase R1: Accept C3 -> C4 R 2 3: OK, Commit Protocol Commit Protocol (Unoptimized version, fast-path quorum = $2F$) Replica $L$ receive a request Replica choose the next available instance Attach attrs deps: list of all instances that interfere seq: number for breaking dependency cycles, larger than max(seq deps) Replica forwards command and attr to at least fast-path quorum of nodes (PreAccept) If quorum responds and attr same => commit else update attr (union deps, set max seq) => tell simple majority to accept => commit Execution Algorithm find strongly connected components topo sort bla bla bla Paper 4....

ZooKeeper

Introduction No blocking primitives (e.g. locks) FIFO Client ordering of all operations linearizable writes leader-based atomic broadcast - Zab ZooKeeper Service Hierarchy Regular/Ephemeral node Watch trigger if data changed Data model Simplify Full read/write filesystem API Offer async and sync access via full path setData delete contain a version para ignore if not match Example Leader configuration change New leader delete ready node New leader async update configuration nodes Leader create ready node Clients check if ready exist Configuration Management Processes share same configuration node Rendezvous Master want to share info with worker but master does not know info until started Master pass node path to worker and write to the node Group Membership Create parent node Children in node = member Simple Lock Clients create Ephemeral lock fail Only one will succeed Others watch Implementation idempotent transaction Read - Any servers weak consistency use sync if needed Write - forward to a single server leader calculate state Client server exchange zxid (last transaction id) Client connect only to servers with same or greater zxid VS Chubby No lock No blocking primitives Chubby blocks update while invalidating others’ cache Access via full path (vs fh) Chubby redirect all operations to master Chubby stricter consistency Chubby slower