• server reliability
    • dual-ported disk
  • disk reliability
    • mirroring files on different disks (same server)
  • network reliability
    • network components replication
    • (work load distributed over networks)

Goal

  • Transparent fail & recover
  • No penalty when normal
  • No client modification

Architecture

  • Each node consists
    • Two servers
      • Each server has 2 network interface & ip
      • Use secondary when impersonating or re-integrating
    • A number of SCSI buses
      • Each disk has one primary server
  • Normal operation
    • Both servers exchange NFS RFS_NULL
    • if failed, ping via network and SCSI
    • take over if both fail
  • Take-over
    • Other server restore file system
    • Change secondary MAC/IP to failed server
  • Re-Integration
    • Failed server turn off primary iface and send request to backup server
    • Backup unmount and reset secondary iface
    • Failed server restore
  • Network Failure
    • When normal
      • Two servers in same node use different network as primary
      • (Load balancing)
      • Servers broadcast heartbeat to network
    • When network fail
      • Client daemon timeout and reroute to alternative path

In-class

  • Goal: Replicated, distributed service
  • Why Hard?
    • Must agree in presence of failures + concurrency while being efficient
  • Simplest case: 2 nodes
    • If one fails, use other
      • detect failure, fail-stop, fail-recovery
    • Challenge: how to keep both copies up-to-date given updates
    • Assume strictest consistency
      • One-copy semantics
  • Approach 1: Symmetric Replicas
    • Send request to all replicas
    • Issue 1: Order of operations
      • R1: create(a), delete(a)
      • R2: delete(a), create(a)
      • replica divergence
      • agree on order of inputs (atomic broadcast)
    • Issue 2: Determinism
      • Server_1 and Serve_2 must come to same result given same input
      • No race condition
      • No randomness
      • No dependence on exact time
    • Issue 3: Failure + Recovery
      • What happens to failed node?
        • Nothing
        • Works on other node
      • What happens when recovers?
        • Catch up with work
        • S2 ship current state or diff
  • Approach 2: Primary-Backup Rep.
    • Client communicates only with primary
    • Primary handles request, updates self, sends state change to backup
    • Handled problems
      • ordering, determinism
    • Failure?
      • Backup failure?
        • Nothing
        • When backup returns
          • logging (diff)
          • whole state shipping
      • Primary failure?
        • Detect failure
        • State catch up
        • Client notice performance issue
    • Issue
      • Failure detection
        • Heartbeats
          • Too short
            • backup takeover while primary is alive
          • Too long
            • more unavailable
      • When to ack client?
        1. ack after backup finish doing stuff
          • safe
        2. ack immediately while backup do stuff concurrently
          • performance
      • If multiple backup, which to take over
        • deterministic known order

HA-NFS

  • Properties
    1. Each node primary/backup
    2. Handle server, disk, network failures
    3. How Local behavior impacts dist. system
    4. Clients unchanged
      • Failure, recovery are transparent
    5. Good common-case (fault-free) performance
  • Crash consistency on 1 machine
    • Implemented on AIXv3
      • Journaling file system
    • Meta-data
      • Inodes, Directories, Bitmap, indirect blocks
    • What gets changes on file append()
      • Inode, Bitmap
    • Goal when chase
      • ensure all metadata is intact
  • Journal
    • use journal for reply cache as well
    • (for non-idenpotent operation)
  • Architecture
    • Normal op
      • Update meta-data, track non-idempotoent cache
    • Failure detection
      • check liveness with heartbeats
      • No response
        • ping via icmp
        • scsi bus target mode
    • Take over
      • getting to fs consistent state
      • replay journal for fs meta-data + reply cache
    • Re-integration
      • S1 reboot send re-integration request to S2
      • S2 unmount Volume 1, switch Network Iface2 back to secondary IP
      • S2 notify S1
      • S1 Reclaim disk, run log reconstruct cache, switch Network Iface 1
  • Network Failures
    • network failures not transparent to client
      • observe (in daemon process)
      • reroute to operational network
    • Heartbeat to check if path is down
    • When impersonating other, must send its heartbeats
  • Performance
    • HA-NFS faster than NFS
      • anything modify meta-data because journal
      • Not fair to add hardware to 1 system + not other
    • Same
      • Read
    • Write slower
      • RAID1 - worstcas seek rotation time
    • Failover
      • 10 sec time-out
      • 5 sec tests for liveness
      • 15 secs for backup to take-over
      • => 30 secs of unavailabilily
    • Re-integration
      • 60s
      • Wait for backup to finish on-going NFS RPC
  • Summary
    • Each node primary/backup (no waste of resource)
    • Handle server, disk, network failure
    • Local behavior impacts distributed system (Journal)