The most profitable query is the one you don't run.

Welcome to the day 12/365!

If you look at your production logs, you will notice a pattern. Users are boring. They ask the same things over and over again.

  • "How do I reset my password?"

  • "What is the refund policy?"

  • "Help with error 505."

If 1,000 users ask "How do I reset my password?", most RAG systems will:

  1. Embed the query (Cost).

  2. Search the Vector DB (Latency).

  3. Send the context to the LLM (Massive Cost).

  4. Generate the answer (Latency).

You just paid for the same calculation 1,000 times.

In traditional web apps, we solve this with Caching (Redis). But standard caching requires an exact string match. If User A types "Reset password" and User B types "I want to change my password," a standard cache sees them as different.

In the AI era, we need Semantic Caching.

1. The concept: Fuzzy caching

Semantic Caching uses the vector embedding itself as the cache key, not the text string.

When a query comes in:

  1. We embed it into a vector.

  2. We search our Cache Vector Store (a lightweight, in-memory index like Redis Stack).

  3. We look for any previous query that has a similarity score of > 0.95.

If we find a match (e.g., "Change password" is 0.98 similar to "Reset password"), we return the previously generated LLM response immediately.

The result?

  • Latency: Drops from 2.5s to 0.05s.

  • Cost: $0.00.

2. The Code

You don't need a complex new tool for this. You can build it with a simple wrapper around your LLM call.

We define a high threshold (0.95 or 0.98) because we only want to cache if the intent is virtually identical.

import numpy as np
from redis import Redis
from sentence_transformers import SentenceTransformer

# 1. Setup Redis Vector Store (Simplified)
redis_client = Redis(host='localhost', port=6379)
embedder = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_cached_completion(user_query):
    # Step A: Embed the query
    vector = embedder.encode(user_query).astype(np.float32).tobytes()

    # Step B: Check Cache (KNN Search in Redis)
    # We look for vectors within a tiny distance radius
    cached_result = redis_client.ft("cache_idx").search(
        Query("*=>[KNN 1 @vector $blob AS score]").return_field("response"),
        query_params={"blob": vector}
    )

    # Step C: Hit or Miss?
    if cached_result.docs and float(cached_result.docs[0].score) < 0.05:
        print("Cache HIT - Returning saved answer")
        return cached_result.docs[0].response

    # Step D: Cache Miss - Call the expensive API
    print("Cache MISS - Calling LLM")
    llm_response = call_openai(user_query)
    
    # Step E: Save for next time
    save_to_redis(vector, llm_response)
    return llm_response

This is the highest ROI code you can write today. It requires zero changes to your frontend.

3. THE CEREBRAL GYM: Solution & New Puzzle

Yesterday's solution (The Hotel Room)

The puzzle was: How do you prevent Alice and Bob from booking the same room without locking the database row (reading) and killing performance?

The Answer: Optimistic Locking (or Optimistic Concurrency Control).

You add a version column to the table.

  1. Alice reads the row. She sees version: 1.

  2. Bob reads the row. He sees version: 1.

  3. Alice tries to update: UPDATE rooms SET status='Booked', version=2 WHERE id=101 AND version=1. Success.

  4. Bob tries to update: UPDATE rooms SET status='Booked', version=2 WHERE id=101 AND version=1. Failure. The database returns "0 rows affected" because the version is now 2, not 1. Bob's application catches this and tells him "Sorry, room just taken."

Today's puzzle (Efficiency) You are building a username registration system. You have 1 billion users. When a new user types a username, you need to check if it's taken. Hitting the database 1 billion times is slow. You want a data structure that sits in memory and can tell you:

  • "No, this username is definitely NOT taken."

  • "Maybe this username is taken (Check the DB)."

It is probabilistic. It is tiny. It never gives a False Negative, but occasionally gives a False Positive. What is this structure called?

(Reply with the name!)

5. THE LATENT SPACE

"Efficiency is not about doing things faster. It is about doing fewer things."

We are obsessed with faster chips and faster models. But the fastest request is the one you never send. Before you buy a bigger GPU, check your logs. How many times did you answer the exact same question today?

Start caching.

See you tomorrow.
Harsh Kathiriya - Query & Context

Keep Reading