Sustain n13 failures

  • f don’t response
  • f response
    • but cannot tell if faulty node respond or not
  • need f + 1 to out number
  • Provide safety and liveness
  • Safety:
    • all non faulty replicas agree on a total order for the execution of requests despite failures.

System Model

  • Network
    • drop, delay, out of order, duplicate
  • Faulty node behave arbitrarily
  • Independent node failures
  • D(m) = digest of m
  • mσi = message sign by node i
  • Adversary
    • can
      • coordinate faulty nodes
      • delay correct nodes
    • cannot
      • delay correct nodes indefinitely
      • quantum compute

Algorithm

Terms

  • replicas move through a succession of view
  • primary of a view is p=vmod|R|
    • p primary replica
    • v view number
    • |R| total replicas
  • f max # to fail
  • Replica = all servers
  • Primary, Backup

Process

  1. Client send request to primary
  2. primary multicasts all backups
  3. Replicas execute request and reply to client
  4. client wait for f+1 same replies

Client

  • Client c send REQUEST,o,t,cσc to master
    • t timestamp
    • o operation to be executed
  • Master broadcast to backups
  • Replicas reply REPLY,v,t,c,i,r
    • r result
    • i replica number
    • v view number
  • Client wait for f+1 replies with same t and r
  • If timeout
    • Client broadcast request to all replica
    • If already processed
      • re-send reply (remember)
    • If not processed & not primary
      • redirect to primary

Normal Case

Replica state

  • message log
  • current view

Three-phase atomic broadcast

Pre-prepare
  • Multicast to all backup and write to log PREPREAPRE,v,n,dσp,m
    • m is the client’s request (message)?
      • maybe not the full content?
    • d: m’s digest
    • n: seq number
  • Backup accepts if
    • Signature match
    • in view v
    • has not accept same v, n with different digest
    • h < seq num < H
      • prevent exhausting seq number (by choosing a large one)
Prepare
  • If backup accept, multicast and write to log
  • PREPARE,v,n,d,iσi
  • Replica accepts if ^d8ce5a
    • Signature match
    • In view
    • h < seq num < H
  • prepared(m, v, n, i) = true if the following are all written to i’s log
    • request m
    • a pre-prepare in v and seq num n
    • 2f prepares from different backups
Commit
  • when prepared(m, v, n, i) = true
  • multicast COMMIT,v,n,D(m),iσi
  • Replicas accept and insert to log (same condition as before)
  • execute if accepted 2f+1 commits
Invariant
  • if prepared(m, v, n i) is true then prepared(m', v, n, j) is false
    • (No prepared with same seq but diff msg)
    • If executed locally, at least f+1 non-faulty replicas have prepared

Checkpoint

  • multicast CHECKPOINT,n,d,iσi
    • every k execution
    • d = checkpoint digest
    • correct if 2f+1 same checkpoint
    • update low, high water mark H, h
    • h = seq num of last stable checkpoint
    • H = h + k (big enough constant)

View-Change

  • Backup start timer if receive request but not executed
  • Multicast VIEWCHANGE,v+1,n,C,P,iσi
    • n: seq num of last checkpoint
    • C: 2f+1 valid checkpoint messages
    • P: un-execute pre-prepared & 2f valid prepare msgs
  • When primary of view v+1 receive 2f other valid view-change 2f+1 in total - Multicast NEWVIEW,v+1,V,Oσp
    • V: all valid VIEW-CHANGE msgs
    • O: new pre-prepares generated by new primary
      • might be null
  • Backup go to new view if receive valid new-view msg

Non-Determinism

  • Primary select the non-determinism value
  • Or let backups propose

In Class

Can’t trust single node

  • Ask all nodes?
    • Liveness
    • Latency
    • Primary can spoof other servers response
      • Solution: Digital signature

Need a primary to order requests

  • What if primary fault?

    • Primary lie to backup
      • Sol: Signature
    • Primary ignore request
      • Sol: timeout => client broadcast
  • Can faulty replicas prevent progress?

    • No, only wait for 2f replies, can tolerate f faulty nodes
  • handle primary sending different ops

  • handle primary lie to clients (wait for p+1 response)

distributed systems - Why is the commit phase in PBFT necessary? - Computer Science Stack Exchange