HA-NFS

server reliability
- dual-ported disk
disk reliability
- mirroring files on different disks (same server)
network reliability
- network components replication
- (work load distributed over networks)

Goal

Transparent fail & recover
No penalty when normal
No client modification

Architecture

Each node consists
- Two servers
  - Each server has 2 network interface & ip
  - Use secondary when impersonating or re-integrating
- A number of SCSI buses
  - Each disk has one primary server
Normal operation
- Both servers exchange NFS RFS_NULL
- if failed, ping via network and SCSI
- take over if both fail
Take-over
- Other server restore file system
- Change secondary MAC/IP to failed server
Re-Integration
- Failed server turn off primary iface and send request to backup server
- Backup unmount and reset secondary iface
- Failed server restore
Network Failure
- When normal
  - Two servers in same node use different network as primary
  - (Load balancing)
  - Servers broadcast heartbeat to network
- When network fail
  - Client daemon timeout and reroute to alternative path

In-class

Goal: Replicated, distributed service
Why Hard?
- Must agree in presence of failures + concurrency while being efficient
Simplest case: 2 nodes
- If one fails, use other
  - detect failure, fail-stop, fail-recovery
- Challenge: how to keep both copies up-to-date given updates
- Assume strictest consistency
  - One-copy semantics
Approach 1: Symmetric Replicas
- Send request to all replicas
- Issue 1: Order of operations
  - R1: create(a), delete(a)
  - R2: delete(a), create(a)
  - replica divergence
  - agree on order of inputs (atomic broadcast)
- Issue 2: Determinism
  - Server_1 and Serve_2 must come to same result given same input
  - No race condition
  - No randomness
  - No dependence on exact time
- Issue 3: Failure + Recovery
  - What happens to failed node?
    - Nothing
    - Works on other node
  - What happens when recovers?
    - Catch up with work
    - S2 ship current state or diff
Approach 2: Primary-Backup Rep.
- Client communicates only with primary
- Primary handles request, updates self, sends state change to backup
- Handled problems
  - ordering, determinism
- Failure?
  - Backup failure?
    - Nothing
    - When backup returns
      - logging (diff)
      - whole state shipping
  - Primary failure?
    - Detect failure
    - State catch up
    - Client notice performance issue
- Issue
  - Failure detection
    - Heartbeats
      - Too short
        backup takeover while primary is alive
      - Too long
        more unavailable
  - When to ack client?
    1. ack after backup finish doing stuff
      - safe
    2. ack immediately while backup do stuff concurrently
      - performance
  - If multiple backup, which to take over
    - deterministic known order

HA-NFS

Properties
1. Each node primary/backup
2. Handle server, disk, network failures
3. How Local behavior impacts dist. system
4. Clients unchanged
  - Failure, recovery are transparent
5. Good common-case (fault-free) performance
Crash consistency on 1 machine
- Implemented on AIXv3
  - Journaling file system
- Meta-data
  - Inodes, Directories, Bitmap, indirect blocks
- What gets changes on file append()
  - Inode, Bitmap
- Goal when chase
  - ensure all metadata is intact
Journal
- use journal for reply cache as well
- (for non-idenpotent operation)
Architecture
- Normal op
  - Update meta-data, track non-idempotoent cache
- Failure detection
  - check liveness with heartbeats
  - No response
    - ping via icmp
    - scsi bus target mode
- Take over
  - getting to fs consistent state
  - replay journal for fs meta-data + reply cache
- Re-integration
  - S1 reboot send re-integration request to S2
  - S2 unmount Volume 1, switch Network Iface2 back to secondary IP
  - S2 notify S1
  - S1 Reclaim disk, run log reconstruct cache, switch Network Iface 1
Network Failures
- network failures not transparent to client
  - observe (in daemon process)
  - reroute to operational network
- Heartbeat to check if path is down
- When impersonating other, must send its heartbeats
Performance
- HA-NFS faster than NFS
  - anything modify meta-data because journal
  - Not fair to add hardware to 1 system + not other
- Same
  - Read
- Write slower
  - RAID1 - worstcas seek rotation time
- Failover
  - 10 sec time-out
  - 5 sec tests for liveness
  - 15 secs for backup to take-over
  - => 30 secs of unavailabilily
- Re-integration
  - 60s
  - Wait for backup to finish on-going NFS RPC
Summary
- Each node primary/backup (no waste of resource)
- Handle server, disk, network failure
- Local behavior impacts distributed system (Journal)

Goal #

Architecture #

In-class #

HA-NFS #

Goal

Architecture

In-class

HA-NFS