EPaxos

Paxos issues

Also apply to Chubby ZooKeeper
All requests go to master
1. Bad scalability
2. High latency when geo-replicate (remote master)
3. Sensitive to load spikes & network delay
4. Master down -> system down (for a while)

EPaxos

Goal

Optimal commit latency in the wide area
Optimal load balancing across all replicas
Graceful performance degradation

So…

Every replica need to act as proposers
- or else the latency will be high
- imply no leader
Minimize proposer’s communication with remote
Quorum composition must be flexible
- to avoid slow nodes

Key

Ordering
- Old: Leader choose order or choose pre ordered command slot
- EPaxos: dynamic & decentralized
- Not necessary to enforce a consistent ordering for non-interfering commands
- Non-interfering
  - 1 RTT
  - Need fast-past quorum of node $F + ⌊ \frac{F + 1}{2} ⌋$
  - $F$ = min# tolerable node failure
  - R1: PreAccept C1
  - R5: PreAccept C2
  - R2, 3, 4: OK => Commit
- Interfering
  - 2 RTT
  - quorum size $F + 1$
  - R5: PreAccept C4
  - R3: OK C4
  - R1: PreAccept C3
  - R3: OK C3 should go after C4
  - R1: Receive inconsistent response, second phase
  - R1: Accept C3 -> C4
  - R 2 3: OK, Commit

Protocol

Commit Protocol

(Unoptimized version, fast-path quorum = $2 F$ )
Replica $L$ receive a request
Replica choose the next available instance
Attach attrs
- deps: list of all instances that interfere
- seq: number for breaking dependency cycles, larger than max(seq deps)
Replica forwards command and attr to at least fast-path quorum of nodes (PreAccept)
If quorum responds and attr same => commit
else update attr (union deps, set max seq)
- => tell simple majority to accept
- => commit

Execution Algorithm

find strongly connected components
topo sort
bla bla bla
Paper 4.3.2

Misc

Can Limit dependency list size
- dependencies might be transitive
- if not, check during execute
Failure
- if timeout while waiting for an instance
- send Explicit Prepare
- learn the result or takeover instance and commit no-op
Replica will be given new id after crash (4.7)
Prioritize completing old commands over proposing new commands
- Prevent live lock

In class

Motivation

Most consensus algorithms have clients communicate with single stable leader (as long as no failures)
- Performance bottleneck
  - Handling all client req
  - comm w/ all followers
- Extra latency if leader is remote
  - WAN
- Load spikes
- Not available when leader has failed (until fail-over)

Goal

Optimal commit latency in wide-area when tolerating 1 or 2 failures, under realistic conditions.
- workload
  - how many conflicting ops?
  - bursty, timing, exact distance
- environment
  - latencies between nodes
Uniform load balancing across all replicas
- high throughput
Graceful performance degradation when replicas are too slow

Key Ideas

Non-conflicting ops can be committed independently (order doesn’t matter)
- E.g.
  - C1 : b = 5
  - C2: a= 6
  - => commit + execution order does not matter
Order commands dynamically and in decentralized fashion
While choosing (voting) cmd for an instance each participant attaches ordering constraints
Guarantee all commit …

When do commands interfere

Tow cmds a + b interfere if there exists some sequence such that a, b != b, a result in
different state machine
different read results
How to determine?
- Data
  - Key-value: (Same key?)
  - Database: (Same row?)
- Explicit
- Everything conflict

When can fast path be taken

If no dependency exist
If same dep list on all replicas
~~If no replicas have failed~~

What if dependency list differs?

Run Paxos-accept
Union of all dependency

How to know when can execute a cmd?

How can client know when cmd is executed?

~~On client ACK~~
To read state, send cmd that interferes with prior state
- will be forced to be ordered afterwards

How many replicas should leader send to?

Assume need to hear back from 3

ALL

+ If some are slow, still hear from 3
- Lots of extra comm. msgs

Min for fast path

+ Send min. comm.
- Slow or dead -> need re-send

Thrifty

Send to min but aggressively send more if reply limit received in some timeout
Does not effect safety

What is bad about reads in EPaxos?

Order + execute all prev. conflicting commands
Other systems:
- Leader/Chubby/Chain:
  - - send all to leader, not scalable
  - + Linearizability
- ZooKeeper:
  - Better scalability, less consistency
- Read lease
  - Temporary stable leader for that object
  - writes must go thru this lease holder

What workloads do they use?

Key-value store
Update/puts only
- no reads
No need for read lease
Vary amount of conficts
- 0, 2, 25, 100%
No eval of cost of determining conflicts

What workloads should they use?

YCSB
- common
- synthetic
- well-defined
- allows comparisons

Performance

EPaxos: 100% conflicts ok on 3 but not 5
- Q = 2, contact just 2
Generalized Paxos
- Even fast path needs 3, more TY on conflicts
Mencius:
- Well: balanced
- Not well: not balanced

Throughput

No slow nodes => compare throughput
- Does adding more replicas improve throughput?
  - NO
One slow node
- multi-paxos: only leader matters
- mencius
- epaxos

Paxos issues #

EPaxos

Goal #

So… #

Key #

Protocol

Commit Protocol #

Execution Algorithm #

Misc #

In class

Motivation #

Goal #

Key Ideas #

When do commands interfere #

When can fast path be taken #

What if dependency list differs? #

How to know when can execute a cmd?

How can client know when cmd is executed? #

How many replicas should leader send to? #

ALL #

Min for fast path #

Thrifty #

What is bad about reads in EPaxos? #

What workloads do they use? #

What workloads should they use? #

Performance #

Throughput #

Paxos issues

Goal

So…

Key

Commit Protocol

Execution Algorithm

Misc

Motivation

Goal

Key Ideas

When do commands interfere

When can fast path be taken

What if dependency list differs?

How can client know when cmd is executed?

How many replicas should leader send to?

ALL

Min for fast path

Thrifty

What is bad about reads in EPaxos?

What workloads do they use?

What workloads should they use?

Performance

Throughput