Failures

Why Do Computers Stop

Availability: doing the right thing with in the specified response time
Reliability: Not doing the wrong thing
Infant mortality: Bugs in new hardware/software
Bohrbug/Heisenbug
- Heisenbug: not happening 100% of the time

Fault-tolerant Execution

Process isolation
with kernel message and not share states

Process-pairs (Spawn a Backup process)

Lockstep:
- 2 processes doing exactly same thing
- Cons: Can’t tolerate Heisenbugs
State Checkpointing
- Send full state to backup after every move
- Cons: Hard to program
Automatic Checkpointing
- Save all messages and replay to backup when priamry failed
- Cons: Send a lot of data
Delta Checkpointing:
- Send state delta to backup
- Cons: hard to program
Persistent:
- lose all state when using backup

Questions

What is the motivation for this work?
- Fault tolernce is important for computer systems
- Need to understand src of faults
- Techniques for handling faults
What is the difference between availability and reliability?
- Availability: doing the right thing with in the specified response time
- Reliability: Not doing the wrong thing
In their measurements, what is the likelihood that different types of failures cause system outages? What is an obvious observation and implication? What are “infant mortality” failures? What is a more interesting observation?
- Admin: 42%, Software: 25%, Hardware: 18%, Environment: 14%
- Infant mortality: new hw or sw with problems and bugs that are still being uncovered + fixed
- sw: don’t fix unless causing problems
- hw: upgrade quickly
What are Gray’s principles of fault-tolerant execution? Why is a process a good abstraction? Why are messages good for fault tolerance? What is an alternative communication mechanism and why is it not as appropriate?
- Software modularity through processes & messages
  - Process: modularity
  - Fault boundaries
  - Protection Boundaries
  - Messages: well-defined interaction points
  - Check content/data of messages
- Alternative communication? Shared memory
What is fail-fast? How can one detect that a component has crashed (or halted)? How does one make a component fail-fast? What is a common approach for checking data is as expected?
- Fail-fast: operates correctly or detects fault, signals failure and stops
  - Checking inputs, structures (checksum)
- Detection
  - component: heart-beat / reset watchdog timer
What are Bohr bugs versus Heisenbugs? How can each be handled?
- Bohr: simple, deterministic (same input -> same bug)
- Heisenbugs: not deterministic (timing/ordering) not repeatable
What is the basic idea of a process pair? What is the problem with a lock step approach? How would you compare full state checkpoints, saving all messages, or sending deltas?
- Primary&Backup
What are transactions and what do they ensure? How do persistent process pairs work with transactions?
```
begin Tx
update stuff...
end Tx
```

Redundancy Does Not Imply Fault-Tolerance

Motivation

File systems will fault
- Corruption
- Read/Write Error
How do distributed storage systems handle them?

Fault Model

Inject a single fault to a single block in a single node

Methodology

errfs
- A FUSE that inject errors/corruptions
errbench
- Read/insert/update an existing data item, check for errors
- Committed data should not be lost
- Queries should not silently return corrupted data
- Cluster should be available for reads and wirtes
- Queries should not fail after retries

Observations

Faults are undetected, and system crashes even if detected
Single fault can take down the whole cluster
Confuse crash and corruption handling
Spread corruption or data loss

End Notes

Systems expect underlying storage stack to be reliable
Recovery code is not rigorously test
Can’t tackle partial faults
Underutilized redundancy

Questions

What is the motivation for this work?
- Do distributed storage systems use redundancy to recover?
- What faults do they investigate?
  - File-system fault
- Are these faults fail-stop?
  - No
- How do these faults happen in the real world?
  - Realistic
How do they inject these faults?
- errfs
Which distributed systems do they investigate?
- 8 (Redis, Monogo …)
What are their expectations for how these systems will react to their injected faults?
- See above
What is the difference between a global effect and a local behavior? What are some of the local behaviors that are observed?
- Log, retry, crash, fix …
Examining Figure 1 in Detail: what is shown in different rows? different columns? across columns of tables? How do the global effects in the table map to their stated expectations? Are there any global effects not mentioned previously? How might you more easily summarize the overall results per system?
- Reduce to 1 row?
How is data corruption often detected by applications? In Redis, what happens if user meta-data is corrupted on the leader – locally and then globally? What happens if user data is corrupted on the leader – locally and then globally? What if this happens on a follower?
In general, what often happens if faults are not detected locally?
- Return corrupted data
What is the most common local reaction to a file system fault? What global effects does this lead to?
- Crashes -> reduced redundancy
How are checksums used to determine if updates have been persisted to local disk? How can a local node determine if a checksum mismatch is due to corruption or crash? What problems can this lead to?
- Data loss
What conclusions would you make from this paper?

Why Do Computers Stop #

Fault-tolerant Execution #

Process-pairs (Spawn a Backup process) #

Questions #

Redundancy Does Not Imply Fault-Tolerance

Motivation #

Fault Model #

Methodology #

Observations #

End Notes #

Questions #

Why Do Computers Stop

Fault-tolerant Execution

Process-pairs (Spawn a Backup process)

Questions

Motivation

Fault Model

Methodology

Observations

End Notes

Questions