- server reliability
- disk reliability
- mirroring files on different disks (same server)
- network reliability
- network components replication
- (work load distributed over networks)
Goal #
- Transparent fail & recover
- No penalty when normal
- No client modification
Architecture #
- Each node consists
- Two servers
- Each server has 2 network interface & ip
- Use secondary when impersonating or re-integrating
- A number of SCSI buses
- Each disk has one primary server
- Normal operation
- Both servers exchange NFS
RFS_NULL
- if failed, ping via network and SCSI
- take over if both fail
- Take-over
- Other server restore file system
- Change secondary MAC/IP to failed server
- Re-Integration
- Failed server turn off primary iface and send request to backup server
- Backup unmount and reset secondary iface
- Failed server restore
- Network Failure
- When normal
- Two servers in same node use different network as primary
- (Load balancing)
- Servers broadcast heartbeat to network
- When network fail
- Client daemon timeout and reroute to alternative path
In-class #
- Goal: Replicated, distributed service
- Why Hard?
- Must agree in presence of failures + concurrency while being efficient
- Simplest case: 2 nodes
- If one fails, use other
- detect failure, fail-stop, fail-recovery
- Challenge: how to keep both copies up-to-date given updates
- Assume strictest consistency
- Approach 1: Symmetric Replicas
- Send request to all replicas
- Issue 1: Order of operations
- R1: create(a), delete(a)
- R2: delete(a), create(a)
- replica divergence
- agree on order of inputs (atomic broadcast)
- Issue 2: Determinism
- Server_1 and Serve_2 must come to same result given same input
- No race condition
- No randomness
- No dependence on exact time
- Issue 3: Failure + Recovery
- What happens to failed node?
- Nothing
- Works on other node
- What happens when recovers?
- Catch up with work
- S2 ship current state or diff
- Approach 2: Primary-Backup Rep.
- Client communicates only with primary
- Primary handles request, updates self, sends state change to backup
- Handled problems
- Failure?
- Backup failure?
- Nothing
- When backup returns
- logging (diff)
- whole state shipping
- Primary failure?
- Detect failure
- State catch up
- Client notice performance issue
- Issue
- Failure detection
- Heartbeats
- Too short
- backup takeover while primary is alive
- Too long
- When to ack client?
- ack after backup finish doing stuff
- ack immediately while backup do stuff concurrently
- If multiple backup, which to take over
- deterministic known order
HA-NFS #
- Properties
- Each node primary/backup
- Handle server, disk, network failures
- How Local behavior impacts dist. system
- Clients unchanged
- Failure, recovery are transparent
- Good common-case (fault-free) performance
- Crash consistency on 1 machine
- Implemented on AIXv3
- Meta-data
- Inodes, Directories, Bitmap, indirect blocks
- What gets changes on file append()
- Goal when chase
- ensure all metadata is intact
- Journal
- use journal for reply cache as well
- (for non-idenpotent operation)
- Architecture
- Normal op
- Update meta-data, track non-idempotoent cache
- Failure detection
- check liveness with heartbeats
- No response
- ping via icmp
- scsi bus target mode
- Take over
- getting to fs consistent state
- replay journal for fs meta-data + reply cache
- Re-integration
- S1 reboot send re-integration request to S2
- S2 unmount Volume 1, switch Network Iface2 back to secondary IP
- S2 notify S1
- S1 Reclaim disk, run log reconstruct cache, switch Network Iface 1
- Network Failures
- network failures not transparent to client
- observe (in daemon process)
- reroute to operational network
- Heartbeat to check if path is down
- When impersonating other, must send its heartbeats
- Performance
- HA-NFS faster than NFS
- anything modify meta-data because journal
- Not fair to add hardware to 1 system + not other
- Same
- Write slower
- RAID1 - worstcas seek rotation time
- Failover
- 10 sec time-out
- 5 sec tests for liveness
- 15 secs for backup to take-over
- => 30 secs of unavailabilily
- Re-integration
- 60s
- Wait for backup to finish on-going NFS RPC
- Summary
- Each node primary/backup (no waste of resource)
- Handle server, disk, network failure
- Local behavior impacts distributed system (Journal)