Chapter 6: Design a Key-Value Store
Problem & Scope
-
APIs:
put(key, value)
,get(key)
(and oftendelete(key)
). -
Assumptions/targets:
- Small item size (≤ ~10 KB), huge dataset, low latency.
- High availability, horizontal scalability, automatic scaling, tunable consistency.
-
Trade-offs: read vs write cost, memory vs disk, consistency vs availability (CAP).
CAP Theorem (framing)
-
Consistency: all clients see the same data at the same time.
-
Availability: every request receives a response (even during failures).
-
Partition tolerance: system continues despite network partitions.
-
In real distributed systems, P is non-negotiable → choose CP (favor C) or AP (favor A).
- CP example: banking balance during partition → may reject requests to keep C.
- AP example: social counters → serve stale data; converge later (eventual consistency).
Data Partitioning
Consistent Hashing (ring)
- Place nodes on a hash ring; place keys on the same ring.
- A key lives on the first node clockwise from its hash position.
- Benefits: minimal key movement when nodes join/leave; even spread with tuning.
Virtual Nodes (vnodes) & Heterogeneity
- Map each physical server to many ring positions.
- Give more vnodes to larger machines → proportional load.
- Smooths hot spots; enables automatic scaling (add/remove nodes with minimal reshuffle).
Data Replication
- Replication factor N: store each item on N distinct physical servers.
- Placement: from key’s ring position, walk clockwise, pick the first N unique servers.
- Purpose: high availability & durability (survive node/zone failures).
Quorum Consensus
-
Parameters: N replicas, W acks to commit a write, R replies to satisfy a read.
-
Rule: If W + R > N, a read is guaranteed to intersect a latest write.
-
Examples (N=3):
- W=2, R=2 → balanced latency & consistency.
- W=1, R=3 → fast writes, heavier reads.
- W=3, R=1 → fast reads, slow writes; lower availability during failures.
-
Add-ons: read repair (fix stale replicas on reads), write-back/repair in background.
Coordinator Role
-
Any node receiving a client request temporarily acts as coordinator:
- Hash key → determine replica set.
- Writes: forward to replicas, wait for W ACKs, return success.
- Reads: query R replicas, reconcile (newest/timestamp or vector clock), return value.
Consistency Models
- Strong consistency: reads return the latest write.
- Weak consistency: recent writes may not be visible.
- Eventual consistency: a weak form where replicas converge given time.
Inconsistency Resolution (Versioning)
-
Why needed: With replication + AP preference, different replicas may temporarily hold conflicting versions.
-
Naive option: Last-write-wins via timestamp → simple, but may lose concurrent updates.
-
Versioning: Use vector clocks to track causal history of updates.
- Each object carries a vector clock:
(node_id, counter)
pairs. - When updating, the coordinator increments its own counter and merges clocks.
- On read, if versions are causally related → discard older one.
- If concurrent (no causal order) → return multiple versions to client → client/app decides merge (e.g., cart union).
- Each object carries a vector clock:
-
Trade-off: vector clocks prevent silent overwrite but increase metadata size (bounded in practice).
Failure Detection: Gossip
- Each node keeps a membership list with heartbeats.
- Periodically increment and gossip heartbeats to random peers; they relay further.
- If a node’s heartbeat doesn’t advance for a grace period → suspect/down.
Handling Temporary Failures
Sloppy Quorum & Hinted Handoff
- If a replica in the key’s preference list is down, write to the next healthy node(s) (sloppy quorum) and mark entries with hints.
- When the original node recovers, peers hand off hinted updates to it, restoring placement.
Handling Permanent Failures
Anti-Entropy with Merkle Trees
-
Background process compares replicas using Merkle trees:
- Internal nodes store hashes of children; compare roots to detect divergence.
- Drill down only into mismatched subtrees → minimize data transfer.
-
Out-of-date ranges are synchronized to the newest version.
Write Path (LSM-Tree)
- Append to Commit Log (sequential disk write; crash-replay for durability).
- Write to in-memory Memtable (sorted structure).
- When Memtable is full → flush to immutable, sorted on-disk file: SSTable.
Read Path
- Check Memtable / row cache.
- If miss, consult Bloom filters to skip SSTables that cannot contain the key.
- Read from candidate SSTables (and/or page cache), merge results (newest wins).
- Optionally perform read repair to update stale replicas.
Summary Table — Goal ↔︎ Technique
Goal | Technique(s) |
---|---|
Even data & easy scaling | Consistent hashing, vnodes |
High availability | Replication (N), sloppy quorum, hinted handoff |
Tunable consistency | R/W/N quorums, per-op consistency levels |
Detect/repair divergence | Gossip, read repair, anti-entropy (Merkle trees) |
Durable & fast writes | Commit log + memtable (LSM) |
Efficient reads | Bloom filters, compaction, caching |
Handle heterogeneous servers | Vnodes proportional to capacity |