The Sunday Commit

On Saturday, we introduced the "Thundering Herd" problem (also known as a Cache Stampede).

The Scenario: You have a high-traffic endpoint (/get-daily-deals) cached in Redis. The cache is set to expire exactly at 12:00:00. At 11:59:59, you have 10,000 users reading the cache. Life is good. Database load is 0%. At 12:00:00, the key vanishes. Suddenly, all 10,000 users get a "MISS." They all turn to the database at the exact same millisecond to re-fetch the data.

Your database cannot handle 10,000 concurrent queries. It crashes. The cache never gets repopulated because the DB is dead. The site stays down.

1. The Solution: Jitter (Randomness)

The problem is synchronization. Everyone agreed to expire the key at the exact same time. To fix this, we need to introduce chaos.

When you set a TTL (Time To Live), never use a round number like "60 seconds." Always add Jitter.

The Wrong Way:

# Everyone expires at exactly 60s
redis.set("deals", data, ex=60)

The Right Way:

import random

# Expire between 55s and 65s
ttl = 60 + random.randint(-5, 5)
redis.set("deals", data, ex=ttl)

Now, if you have 1,000 cache keys, they won't all expire at once. The re-computation load is spread out over 10 seconds instead of 1 millisecond.

2. Advanced Fix: The "Probabilistic Early Expiration"

For single "Hot Keys" (like the homepage), Jitter isn't enough. You need the Locking Pattern.

  1. User A sees the cache is missing.

  2. User A acquires a Redis Lock (SET NX) saying "I am rebuilding this key."

  3. User A hits the DB.

  4. User B, C, and D see the cache is missing, but they see the lock exists.

  5. Instead of hitting the DB, they wait (or serve stale data) until User A finishes.

This ensures only one query hits the database, even if 10,000 users are waiting.

3. The Sunday Whiteboard

Let's look at a new problem for the week ahead.

The Scenario: You have 3 Redis servers (Node A, Node B, Node C). You distribute keys using simple modulo hashing: server = hash(key) % 3.

  • Key 1 goes to Node A.

  • Key 2 goes to Node B.

You add a new server (Node D) to handle more load. The formula becomes hash(key) % 4. Suddenly, 75% of your cache keys are invalid. Because the divisor changed from 3 to 4, the math shifted almost every key to a different server. Your cache hit rate drops to zero. Your database melts down.

The Question: What is the specific algorithm (visualized as a "Ring") that allows you to add/remove servers and only move 1/N keys, rather than reshuffling the entire cluster?

(Reply with the concept!)

4. THE PULSE: Recommended Reading

"Facebook's Memcached Paper" This is one of the most famous papers in distributed systems history. It explains how Facebook handles billions of requests using Memcached, and specifically how they solved the Thundering Herd problem using "Leases" (the locking pattern described above). Link: usenix.org/conference/nsdi13/technical-sessions/presentation/nishtala

5. THE LATENT SPACE

"Perfect synchronization is a bug."

In nature, if everyone steps on the bridge at the same time, the bridge collapses (resonance). In systems, if everyone hits the database at the same time, the server collapses. Randomness (Jitter) is the secret ingredient to stability.

See you on Monday.

See you tomorrow.
Harsh Kathiriya - Query & Context

Keep Reading