The "Username Taken" Problem.

You are building the sign-up page for "Face-o-gram." You have 1 Billion registered users. Every time a new user types a username, you need to check if it's taken.

Approach A: The SQL Query SELECT count(*) FROM users WHERE username = 'cool_guy'

  • Result: You hit the database disk 10,000 times a second. The DB melts. Latency is 200ms.

Approach B: The In-Memory Set Load all 1 billion strings into a Python set().

  • Result: 1 Billion strings * 30 bytes ≈ 30 GB of RAM. Your server crashes with Out Of Memory (OOM).

Approach C: The Bloom Filter You use a probabilistic bit-array.

  • Result: It uses 200 MB of RAM. It answers in 0.001ms. It never hits the database.

1. The Concept: "Definitely No" vs. "Maybe Yes"

A Bloom Filter is a space-efficient data structure that tells you whether an element is present in a set. It has a unique property:

  1. False Positives are possible: It might say "Yes, this username is taken" when it is actually free. (We can tune this error rate to 1%).

  2. False Negatives are impossible: If it says "No," the username is guaranteed to be free.

This "One-Sided Error" is perfect for caching.

  • If the Filter says "No", we let the user sign up immediately. (Zero DB Load).

  • If the Filter says "Maybe", we check the DB to be sure. (1% DB Load).

2. The Mechanics: Bit Flipping

Imagine an array of 10 bits, all set to 0. We have 3 hash functions.

To Insert "Alice":

  1. Hash 1("Alice") -> Index 2. Flip bit 2 to 1.

  2. Hash 2("Alice") -> Index 5. Flip bit 5 to 1.

  3. Hash 3("Alice") -> Index 9. Flip bit 9 to 1.

  • Array: [0, 0, 1, 0, 0, 1, 0, 0, 0, 1]

To Check "Alice": Run the hashes. Are bits 2, 5, and 9 all 1? Yes. Result: Maybe Present.

To Check "Bob": Hashes map to Indices 2, 4, and 8. Bit 4 is 0. Since Bit 4 is 0, "Bob" was never inserted. Result: Definitely Not Present.

You don't need to write the bit-math yourself. Use pybloom_live or Redis.

Python Implementation

from pybloom_live import ScalableBloomFilter

# 1. Create Filter (Targeting 0.1% error rate)
# It grows automatically as you add items.
sbf = ScalableBloomFilter(mode=ScalableBloomFilter.SMALL_SET_GROWTH, error_rate=0.001)

# 2. Add 1 million users
sbf.add("alice")
sbf.add("bob")

# 3. The Fast Check
query = "charlie"

if query in sbf:
    print("Maybe taken. Checking DB...")
    # Expensive DB call here
else:
    print("Available! (100% Certainty)")
    # No DB call needed

3. THE CEREBRAL GYM: Solution & New Puzzle

Yesterday's solution (The Bit Array)

The puzzle was: What data structure uses probability to check set membership without storing the actual data?

The Answer: The Bloom Filter. (As explained above).

Today's puzzle (Database Internals) Tuesday is for Storage Engines.

Standard databases (B-Trees) are slow at writing because they have to jump around the disk to update random pages. Modern "Write-Heavy" databases (like Cassandra and RocksDB) use a different structure. They write everything to a log file sequentially (append-only) in memory (MemTable). When memory is full, they flush it to disk as an immutable file (SSTable). Later, a background process merges these files to delete old data.

The Question: What is the specific name of this "Log-Structured" tree architecture?

(Reply with the acronym!)

4. THE PULSE: Tools of the day

  • RedisBloom Redis doesn't support Bloom Filters out of the box, but the RedisBloom module adds them. It is incredibly fast.

    • BF.ADD my_filter "user1"

    • BF.EXISTS my_filter "user1"

  • "The Case for Learned Index Structures" A Google Research paper that asks: "Can we replace a Bloom Filter with a Neural Network?" It turns out, if your data follows a pattern (e.g., all keys are integers), a Model is smaller and faster than a Hash. Link: arxiv.org/abs/1712.01208

5. THE LATENT SPACE

"Certainty is expensive. Probability is cheap."

Engineers are obsessed with 100% accuracy. But in distributed systems, 99% accuracy often costs 1% of the price. If you can handle a 1% false positive rate (by adding a secondary check), you can save 99% of your RAM.

Embrace the error.

See you tomorrow.
Harsh Kathiriya - Query & Context

Keep Reading