Sustain $\lfloor \frac{n-1}{3} \rfloor$ failures

  • $f$ don’t response
  • $f$ response
    • but cannot tell if faulty node respond or not
  • need $f$ + 1 to out number
  • Provide safety and liveness
  • Safety:
    • all non faulty replicas agree on a total order for the execution of requests despite failures.

System Model

  • Network
    • drop, delay, out of order, duplicate
  • Faulty node behave arbitrarily
  • Independent node failures
  • $D(m)$ = digest of m
  • $\langle m \rangle _{\sigma_i}$ = message sign by node $i$
  • Adversary
    • can
      • coordinate faulty nodes
      • delay correct nodes
    • cannot
      • delay correct nodes indefinitely
      • quantum compute

Algorithm

Terms

  • replicas move through a succession of view
  • primary of a view is $p = v \mod |R|$
    • $p$ primary replica
    • $v$ view number
    • $|R|$ total replicas
  • $f$ max # to fail
  • Replica = all servers
  • Primary, Backup

Process

  1. Client send request to primary
  2. primary multicasts all backups
  3. Replicas execute request and reply to client
  4. client wait for $f + 1$ same replies

Client

  • Client $c$ send $\langle REQUEST, o, t, c \rangle_{\sigma_c}$ to master
    • $t$ timestamp
    • $o$ operation to be executed
  • Master broadcast to backups
  • Replicas reply $\langle REPLY,v,t,c,i,r \rangle$
    • $r$ result
    • $i$ replica number
    • $v$ view number
  • Client wait for $f + 1$ replies with same $t$ and $r$
  • If timeout
    • Client broadcast request to all replica
    • If already processed
      • re-send reply (remember)
    • If not processed & not primary
      • redirect to primary

Normal Case

Replica state

  • message log
  • current view

Three-phase atomic broadcast

Pre-prepare
  • Multicast to all backup and write to log $\langle \langle PRE-PREAPRE, v, n, d \rangle_{\sigma_p}, m \rangle$
    • $m$ is the client’s request (message)?
      • maybe not the full content?
    • $d$: $m$’s digest
    • $n$: seq number
  • Backup accepts if
    • Signature match
    • in view $v$
    • has not accept same $v$, $n$ with different digest
    • $h$ < seq num < $H$
      • prevent exhausting seq number (by choosing a large one)
Prepare
  • If backup accept, multicast and write to log
  • $\langle PREPARE, v, n, d, i \rangle_{\sigma_i}$
  • Replica accepts if ^d8ce5a
    • Signature match
    • In view
    • h < seq num < H
  • prepared(m, v, n, i) = true if the following are all written to $i$’s log
    • request $m$
    • a pre-prepare in $v$ and seq num $n$
    • $2f$ prepares from different backups
Commit
  • when prepared(m, v, n, i) = true
  • multicast $\langle COMMIT, v, n, D(m), i\rangle_{\sigma_i}$
  • Replicas accept and insert to log (same condition as before)
  • execute if accepted $2f + 1$ commits
Invariant
  • if prepared(m, v, n i) is true then prepared(m', v, n, j) is false
    • (No prepared with same seq but diff msg)
    • If executed locally, at least f+1 non-faulty replicas have prepared

Checkpoint

  • multicast $\langle CHECKPOINT, n, d, i \rangle_{\sigma_i}$
    • every $k$ execution
    • $d$ = checkpoint digest
    • correct if $2f + 1$ same checkpoint
    • update low, high water mark $H$, $h$
    • $h$ = seq num of last stable checkpoint
    • $H$ = $h$ + $k$ (big enough constant)

View-Change

  • Backup start timer if receive request but not executed
  • Multicast $\langle VIEW-CHANGE, v+1, n, C, P, i \rangle _{\sigma_i}$
    • $n$: seq num of last checkpoint
    • $C$: $2f + 1$ valid checkpoint messages
    • $P$: un-execute pre-prepared & $2f$ valid prepare msgs
  • When primary of view $v + 1$ receive $2f$ other valid view-change $2f+1$ in total - Multicast $\langle NEW-VIEW, v + 1, V, O\rangle_{\sigma_p}$
    • $V$: all valid VIEW-CHANGE msgs
    • $O$: new pre-prepares generated by new primary
      • might be null
  • Backup go to new view if receive valid new-view msg

Non-Determinism

  • Primary select the non-determinism value
  • Or let backups propose

In Class

Can’t trust single node

  • Ask all nodes?
    • Liveness
    • Latency
    • Primary can spoof other servers response
      • Solution: Digital signature

Need a primary to order requests

  • What if primary fault?

    • Primary lie to backup
      • Sol: Signature
    • Primary ignore request
      • Sol: timeout => client broadcast
  • Can faulty replicas prevent progress?

    • No, only wait for $2f$ replies, can tolerate $f$ faulty nodes
  • handle primary sending different ops

  • handle primary lie to clients (wait for $p+1$ response)

distributed systems - Why is the commit phase in PBFT necessary? - Computer Science Stack Exchange