• Why build distributed systems?
    • Performance of single computer can’t handle load
      • throughput, latency
      • cost/performance
      • commodity components
      • elasticity
      • incremental scalability
    • Fault-tolerance
      • availability
      • reliability: don’t do something wrong
      • data sharing
  • Why study distributed systems?
    • Important
    • Practical
    • Challenging / Interesting
  • Why is it challenging
    • Faults - Fail-stop, crash
      • slow nodes, misbehaving nodes
      • nodes disagree, who to trust?
    • Interperoleat: lack of global state
      • File A has different content
      • how man jobs on node B?
      • Nodes are part of the system?
      • Nodes delays, ordering, lost messages
    • Obtaining high performance
      • network comm. adds overhead
      • extra work in fault handling
    • Lowering costs
      • figure bottlenecks
      • figure amount of redundancy needed for desired reliability/availabilty

What Types of Failures

  • Halting failures
    • Crash & no notification
  • Fail-stop
    • Halting failure + notification
  • Omission failures
  • Network failure
  • Network partition failure
  • Timing failures
  • Byzantine failures