Chapter 6: Design a Key-Value Store

Problem & Scope

APIs: put(key, value), get(key) (and often delete(key)).
Assumptions/targets:
- Small item size (≤ ~10 KB), huge dataset, low latency.
- High availability, horizontal scalability, automatic scaling, tunable consistency.
Trade-offs: read vs write cost, memory vs disk, consistency vs availability (CAP).

CAP Theorem (framing)

Consistency: all clients see the same data at the same time.
Availability: every request receives a response (even during failures).
Partition tolerance: system continues despite network partitions.
In real distributed systems, P is non-negotiable → choose CP (favor C) or AP (favor A).
- CP example: banking balance during partition → may reject requests to keep C.
- AP example: social counters → serve stale data; converge later (eventual consistency).

Data Partitioning

Consistent Hashing (ring)

Place nodes on a hash ring; place keys on the same ring.
A key lives on the first node clockwise from its hash position.
Benefits: minimal key movement when nodes join/leave; even spread with tuning.

Virtual Nodes (vnodes) & Heterogeneity

Map each physical server to many ring positions.
Give more vnodes to larger machines → proportional load.
Smooths hot spots; enables automatic scaling (add/remove nodes with minimal reshuffle).

Data Replication

Replication factor N: store each item on N distinct physical servers.
Placement: from key’s ring position, walk clockwise, pick the first N unique servers.
Purpose: high availability & durability (survive node/zone failures).

Quorum Consensus

Parameters: N replicas, W acks to commit a write, R replies to satisfy a read.
Rule: If W + R > N, a read is guaranteed to intersect a latest write.
Examples (N=3):
- W=2, R=2 → balanced latency & consistency.
- W=1, R=3 → fast writes, heavier reads.
- W=3, R=1 → fast reads, slow writes; lower availability during failures.
Add-ons: read repair (fix stale replicas on reads), write-back/repair in background.

Coordinator Role

Any node receiving a client request temporarily acts as coordinator:
1. Hash key → determine replica set.
2. Writes: forward to replicas, wait for W ACKs, return success.
3. Reads: query R replicas, reconcile (newest/timestamp or vector clock), return value.

Consistency Models

Strong consistency: reads return the latest write.
Weak consistency: recent writes may not be visible.
Eventual consistency: a weak form where replicas converge given time.

Inconsistency Resolution (Versioning)

Why needed: With replication + AP preference, different replicas may temporarily hold conflicting versions.
Naive option: Last-write-wins via timestamp → simple, but may lose concurrent updates.
Versioning: Use vector clocks to track causal history of updates.
- Each object carries a vector clock: (node_id, counter) pairs.
- When updating, the coordinator increments its own counter and merges clocks.
- On read, if versions are causally related → discard older one.
- If concurrent (no causal order) → return multiple versions to client → client/app decides merge (e.g., cart union).
Trade-off: vector clocks prevent silent overwrite but increase metadata size (bounded in practice).

Failure Detection: Gossip

Each node keeps a membership list with heartbeats.
Periodically increment and gossip heartbeats to random peers; they relay further.
If a node’s heartbeat doesn’t advance for a grace period → suspect/down.

Handling Temporary Failures

Sloppy Quorum & Hinted Handoff

If a replica in the key’s preference list is down, write to the next healthy node(s) (sloppy quorum) and mark entries with hints.
When the original node recovers, peers hand off hinted updates to it, restoring placement.

Handling Permanent Failures

Anti-Entropy with Merkle Trees

Background process compares replicas using Merkle trees:
- Internal nodes store hashes of children; compare roots to detect divergence.
- Drill down only into mismatched subtrees → minimize data transfer.
Out-of-date ranges are synchronized to the newest version.

Write Path (LSM-Tree)

Append to Commit Log (sequential disk write; crash-replay for durability).
Write to in-memory Memtable (sorted structure).
When Memtable is full → flush to immutable, sorted on-disk file: SSTable.

Read Path

Check Memtable / row cache.
If miss, consult Bloom filters to skip SSTables that cannot contain the key.
Read from candidate SSTables (and/or page cache), merge results (newest wins).
Optionally perform read repair to update stale replicas.

Summary Table — Goal ↔︎ Technique

Goal	Technique(s)
Even data & easy scaling	Consistent hashing, vnodes
High availability	Replication (N), sloppy quorum, hinted handoff
Tunable consistency	R/W/N quorums, per-op consistency levels
Detect/repair divergence	Gossip, read repair, anti-entropy (Merkle trees)
Durable & fast writes	Commit log + memtable (LSM)
Efficient reads	Bloom filters, compaction, caching
Handle heterogeneous servers	Vnodes proportional to capacity

Problem & Scope​

CAP Theorem (framing)​

Data Partitioning​

Consistent Hashing (ring)​

Virtual Nodes (vnodes) & Heterogeneity​

Data Replication​

Quorum Consensus​

Coordinator Role​

Consistency Models​

Inconsistency Resolution (Versioning)​

Failure Detection: Gossip​

Handling Temporary Failures​

Sloppy Quorum & Hinted Handoff​

Handling Permanent Failures​

Anti-Entropy with Merkle Trees​

Write Path (LSM-Tree)​

Read Path​

Summary Table — Goal ↔︎ Technique​

Problem & Scope

CAP Theorem (framing)

Data Partitioning

Consistent Hashing (ring)

Virtual Nodes (vnodes) & Heterogeneity

Data Replication

Quorum Consensus

Coordinator Role

Consistency Models

Inconsistency Resolution (Versioning)

Failure Detection: Gossip

Handling Temporary Failures

Sloppy Quorum & Hinted Handoff

Handling Permanent Failures

Anti-Entropy with Merkle Trees

Write Path (LSM-Tree)

Read Path

Summary Table — Goal ↔︎ Technique