Skip to main content

Chapter 6: Design a Key-Value Store

Problem & Scope

  • APIs: put(key, value), get(key) (and often delete(key)).

  • Assumptions/targets:

    • Small item size (≤ ~10 KB), huge dataset, low latency.
    • High availability, horizontal scalability, automatic scaling, tunable consistency.
  • Trade-offs: read vs write cost, memory vs disk, consistency vs availability (CAP).

CAP Theorem (framing)

  • Consistency: all clients see the same data at the same time.

  • Availability: every request receives a response (even during failures).

  • Partition tolerance: system continues despite network partitions.

  • In real distributed systems, P is non-negotiable → choose CP (favor C) or AP (favor A).

    • CP example: banking balance during partition → may reject requests to keep C.
    • AP example: social counters → serve stale data; converge later (eventual consistency).

Data Partitioning

Consistent Hashing (ring)

  • Place nodes on a hash ring; place keys on the same ring.
  • A key lives on the first node clockwise from its hash position.
  • Benefits: minimal key movement when nodes join/leave; even spread with tuning.

Virtual Nodes (vnodes) & Heterogeneity

  • Map each physical server to many ring positions.
  • Give more vnodes to larger machines → proportional load.
  • Smooths hot spots; enables automatic scaling (add/remove nodes with minimal reshuffle).

Data Replication

  • Replication factor N: store each item on N distinct physical servers.
  • Placement: from key’s ring position, walk clockwise, pick the first N unique servers.
  • Purpose: high availability & durability (survive node/zone failures).

Quorum Consensus

  • Parameters: N replicas, W acks to commit a write, R replies to satisfy a read.

  • Rule: If W + R > N, a read is guaranteed to intersect a latest write.

  • Examples (N=3):

    • W=2, R=2 → balanced latency & consistency.
    • W=1, R=3 → fast writes, heavier reads.
    • W=3, R=1 → fast reads, slow writes; lower availability during failures.
  • Add-ons: read repair (fix stale replicas on reads), write-back/repair in background.

Coordinator Role

  • Any node receiving a client request temporarily acts as coordinator:

    1. Hash key → determine replica set.
    2. Writes: forward to replicas, wait for W ACKs, return success.
    3. Reads: query R replicas, reconcile (newest/timestamp or vector clock), return value.

Consistency Models

  • Strong consistency: reads return the latest write.
  • Weak consistency: recent writes may not be visible.
  • Eventual consistency: a weak form where replicas converge given time.

Inconsistency Resolution (Versioning)

  • Why needed: With replication + AP preference, different replicas may temporarily hold conflicting versions.

  • Naive option: Last-write-wins via timestamp → simple, but may lose concurrent updates.

  • Versioning: Use vector clocks to track causal history of updates.

    • Each object carries a vector clock: (node_id, counter) pairs.
    • When updating, the coordinator increments its own counter and merges clocks.
    • On read, if versions are causally related → discard older one.
    • If concurrent (no causal order) → return multiple versions to client → client/app decides merge (e.g., cart union).
  • Trade-off: vector clocks prevent silent overwrite but increase metadata size (bounded in practice).

Failure Detection: Gossip

  • Each node keeps a membership list with heartbeats.
  • Periodically increment and gossip heartbeats to random peers; they relay further.
  • If a node’s heartbeat doesn’t advance for a grace period → suspect/down.

Handling Temporary Failures

Sloppy Quorum & Hinted Handoff

  • If a replica in the key’s preference list is down, write to the next healthy node(s) (sloppy quorum) and mark entries with hints.
  • When the original node recovers, peers hand off hinted updates to it, restoring placement.

Handling Permanent Failures

Anti-Entropy with Merkle Trees

  • Background process compares replicas using Merkle trees:

    • Internal nodes store hashes of children; compare roots to detect divergence.
    • Drill down only into mismatched subtrees → minimize data transfer.
  • Out-of-date ranges are synchronized to the newest version.

Write Path (LSM-Tree)

  1. Append to Commit Log (sequential disk write; crash-replay for durability).
  2. Write to in-memory Memtable (sorted structure).
  3. When Memtable is full → flush to immutable, sorted on-disk file: SSTable.

Read Path

  1. Check Memtable / row cache.
  2. If miss, consult Bloom filters to skip SSTables that cannot contain the key.
  3. Read from candidate SSTables (and/or page cache), merge results (newest wins).
  4. Optionally perform read repair to update stale replicas.

Summary Table — Goal ↔︎ Technique

GoalTechnique(s)
Even data & easy scalingConsistent hashing, vnodes
High availabilityReplication (N), sloppy quorum, hinted handoff
Tunable consistencyR/W/N quorums, per-op consistency levels
Detect/repair divergenceGossip, read repair, anti-entropy (Merkle trees)
Durable & fast writesCommit log + memtable (LSM)
Efficient readsBloom filters, compaction, caching
Handle heterogeneous serversVnodes proportional to capacity