Why Do Computers Stop
- Availability: doing the right thing with in the specified response time
- Reliability: Not doing the wrong thing
- Infant mortality: Bugs in new hardware/software
- Bohrbug/Heisenbug
- Heisenbug: not happening 100% of the time
Fault-tolerant Execution
- Process isolation
- with kernel message and not share states
Process-pairs (Spawn a Backup process)
- Lockstep:
- 2 processes doing exactly same thing
- Cons: Can’t tolerate Heisenbugs
- State Checkpointing
- Send full state to backup after every move
- Cons: Hard to program
- Automatic Checkpointing
- Save all messages and replay to backup when priamry failed
- Cons: Send a lot of data
- Delta Checkpointing:
- Send state delta to backup
- Cons: hard to program
- Persistent:
- lose all state when using backup
Questions
- What is the motivation for this work?
- Fault tolernce is important for computer systems
- Need to understand src of faults
- Techniques for handling faults
- What is the difference between availability and reliability?
- Availability: doing the right thing with in the specified response time
- Reliability: Not doing the wrong thing
- In their measurements, what is the likelihood that different types of failures cause system outages? What is an obvious observation and implication? What are “infant mortality” failures? What is a more interesting observation?
- Admin: 42%, Software: 25%, Hardware: 18%, Environment: 14%
- Infant mortality: new hw or sw with problems and bugs that are still being uncovered + fixed
- sw: don’t fix unless causing problems
- hw: upgrade quickly
- What are Gray’s principles of fault-tolerant execution? Why is a process a good abstraction? Why are messages good for fault tolerance? What is an alternative communication mechanism and why is it not as appropriate?
- Software modularity through processes & messages
- Process: modularity
- Fault boundaries
- Protection Boundaries
- Messages: well-defined interaction points
- Check content/data of messages
- Alternative communication? Shared memory
- Software modularity through processes & messages
- What is fail-fast? How can one detect that a component has crashed (or halted)? How does one make a component fail-fast? What is a common approach for checking data is as expected?
- Fail-fast: operates correctly or detects fault, signals failure and stops
- Checking inputs, structures (checksum)
- Detection
- component: heart-beat / reset watchdog timer
- Fail-fast: operates correctly or detects fault, signals failure and stops
- What are Bohr bugs versus Heisenbugs? How can each be handled?
- Bohr: simple, deterministic (same input -> same bug)
- Heisenbugs: not deterministic (timing/ordering) not repeatable
- What is the basic idea of a process pair? What is the problem with a lock step approach? How would you compare full state checkpoints, saving all messages, or sending deltas?
- Primary&Backup
- What are transactions and what do they ensure? How do persistent process pairs work with transactions?
begin Tx update stuff... end Tx
Redundancy Does Not Imply Fault-Tolerance
Motivation
- File systems will fault
- Corruption
- Read/Write Error
- How do distributed storage systems handle them?
Fault Model
- Inject a single fault to a single block in a single node
Methodology
- errfs
- A FUSE that inject errors/corruptions
- errbench
- Read/insert/update an existing data item, check for errors
- Committed data should not be lost
- Queries should not silently return corrupted data
- Cluster should be available for reads and wirtes
- Queries should not fail after retries
Observations
- Faults are undetected, and system crashes even if detected
- Single fault can take down the whole cluster
- Confuse crash and corruption handling
- Spread corruption or data loss
End Notes
- Systems expect underlying storage stack to be reliable
- Recovery code is not rigorously test
- Can’t tackle partial faults
- Underutilized redundancy
Questions
What is the motivation for this work?
- Do distributed storage systems use redundancy to recover?
- What faults do they investigate?
- File-system fault
- Are these faults fail-stop?
- No
- How do these faults happen in the real world?
- Realistic
- What faults do they investigate?
How do they inject these faults?
- errfs
Which distributed systems do they investigate?
- 8 (Redis, Monogo …)
What are their expectations for how these systems will react to their injected faults?
- See above
What is the difference between a global effect and a local behavior? What are some of the local behaviors that are observed?
- Log, retry, crash, fix …
Examining Figure 1 in Detail: what is shown in different rows? different columns? across columns of tables? How do the global effects in the table map to their stated expectations? Are there any global effects not mentioned previously? How might you more easily summarize the overall results per system?
- Reduce to 1 row?
How is data corruption often detected by applications? In Redis, what happens if user meta-data is corrupted on the leader – locally and then globally? What happens if user data is corrupted on the leader – locally and then globally? What if this happens on a follower?
In general, what often happens if faults are not detected locally?
- Return corrupted data
What is the most common local reaction to a file system fault? What global effects does this lead to?
- Crashes -> reduced redundancy
How are checksums used to determine if updates have been persisted to local disk? How can a local node determine if a checksum mismatch is due to corruption or crash? What problems can this lead to?
- Data loss
What conclusions would you make from this paper?