Assumptions

  • KV Store
  • Small object (< 1MB)
  • Week consistency

Consideration

  • Always writeable
    • Conflict resolution during read
  • Application can implement its own merge policy
    • Or use simple policy the data store provides (last write win)

Architecture

Interface

  • get(key) -> (Array[object], context)
    • 1 object or many conflicting objects
  • put(key, context, object)
    • context: vector clock and other metadata

Partitioning

  • Each node hash to a location on a ring
  • Data hash to another location on ring
    • rotate clockwise, find nearest node
  • Improvement: Each node hash to many virtual node on the ring

Replication

  • Replicates at N-1 clockwise successor nodes
  • preference list - The list of nodes that is responsible for storing a particular key

Versioning

  • Vector clock compression
    • Remove oldest clock when exceeds threshold

get()/put()

  • Set $R$, $W$
  • Write to a coordinator (nearest or load balancer)
    • coordinator send to top $N$ in preference list
    • wait for $W-1$ replies ($W$ in total)
    • return
  • Read from $N$ nodes, wait for $R$ responses

Failures

  • When a server is down
  • Send data to next healthy note with a hint
  • When backup send data back to previously failed node

Membership change

  • Require manual actions
  • Each node contacts a peer at random every second to sync membership/partition changes
  • Detect failure locally

???

To account for node failures, preference list contains more than N nodes
External Discovery