Query & Context

Welcome back!!

Quality over Quantity.

On Tuesday, we implemented Hybrid Search to make sure we didn't miss specific keywords (like error codes). Now, our retrieval pipeline is finding everything relevant.

But "everything" is a problem. If you retrieve 20 documents and feed them all into the LLM, two things happen:

Cost: You pay for massive input tokens.
The "Lost in the Middle" Effect: LLMs are terrible at finding needle-in-a-haystack info when it's buried in the middle of 20 irrelevant chunks.

You need to sort them. But your Vector DB sorts by Cosine Similarity, which is a "fuzzy" metric. It’s fast, but often wrong.

To fix RAG, we need a second brain: The Reranker.

1. THE CONCEPT: Bi-Encoders vs. Cross-Encoders

Why is your Vector DB so fast?

Because it uses Bi-Encoders. It pre-calculates the vector for the document once and stores it. When you search, it just does a quick math comparison. It treats the Query and the Document as two separate strangers passing in the night.

A Cross-Encoder (Reranker) is different.

It takes the Query and the Document and looks at them together at the same time. It asks: "How relevant is this specific sentence to this specific question?"

Bi-Encoder (Vector DB): Fast (0.01s). Good at finding the "Top 50" candidates.
Cross-Encoder (Reranker): Slow (0.5s). Amazing at sorting the "Top 5" winners.

The Architecture:

We don't choose one. We chain them.

User Query → Vector DB (Get Top 50) → Reranker (Pick Top 5) → LLM.

2. THE CODE: The 2-Stage Retrieval Pattern

We will use the Cohere Rerank API (or the open-source bge-reranker from HuggingFace) to re-sort our results.

import cohere

co = cohere.Client('YOUR_API_KEY')

def two_stage_retrieval(user_query):
    # Stage 1: Fast & Broad (Vector DB)
    # We fetch 50 documents, intentionally over-fetching.
    initial_results = pinecone_index.query(
        vector=embed(user_query), 
        top_k=50
    )
    docs = [match.metadata['text'] for match in initial_results]

    # Stage 2: Slow & Precise (Reranker)
    # The Reranker reads all 50 and assigns a "Relevance Score" (0 to 1)
    reranked = co.rerank(
        model='rerank-english-v3.0',
        query=user_query,
        documents=docs,
        top_n=5  # We only keep the absolute best 5
    )

    return [hit.document['text'] for hit in reranked]

# Code for using bge-reranker from HuggingFace

from sentence_transformers import CrossEncoder

# Load the open-source Reranker (Small & Fast)
# 'BAAI/bge-reranker-base' is the current SOTA for efficient reranking
reranker = CrossEncoder('BAAI/bge-reranker-base', max_length=512)

def rerank_local(user_query, initial_docs, top_n=5):
    # 1. Prepare pairs: [ [Query, Doc1], [Query, Doc2], ... ]
    # The Cross-Encoder needs to see them together to judge relevance.
    pairs = [[user_query, doc] for doc in initial_docs]
    
    # 2. Score them
    scores = reranker.predict(pairs)
    
    # 3. Sort by score (High to Low) and slice top_n
    # We zip them together to keep the score attached to the doc
    scored_docs = sorted(
        zip(initial_docs, scores), 
        key=lambda x: x[1], 
        reverse=True
    )
    
    # Return just the text of the winners
    return [doc for doc, score in scored_docs[:top_n]]

The Result: You send 90% less text to the LLM, but the text you do send is 10x more relevant. Your answers get sharper, and your API bill gets smaller.

3. THE CEREBRAL GYM: Solution & New Puzzle

3. THE CEREBRAL GYM: Solutions & Whiteboarding

Yesterday's Solution (Python Traps)

The Challenge: Why did event_list=[] persist data between function calls?
The Answer: In Python, default arguments are evaluated only once when the function is defined, not every time it runs. The list is created in memory at startup and shared across all calls. The Fix: Use None as the default sentinel.

def log_event(event, event_list=None):
    if event_list is None:
        event_list = []  # New list created every time
    event_list.append(event)
    return event_list

Today's Puzzle (SQL Optimization)

Back to the database.

The Scenario: You have a massive table called logs (1 billion rows) with a timestamp column. You need to delete all logs older than 1 year. You run: DELETE FROM logs WHERE timestamp < NOW() - INTERVAL '1 year';

The Disaster: The database freezes. The transaction log fills up. The site goes down. The Question: Why did a simple DELETE crash the DB, and what is the specific technique to delete 100 million rows safely in production without locking the table?

(Reply with the technique!)

4. THE PULSE

Neurable’s gaming headset wants to read your mind: The new model specifically tuned for RAG and Tool Use is topping the leaderboards. It’s explicitly designed to work with their Reranker.
Asus scales up its MacBook Air rival: Asus is going big without getting heavy. Announced at CES 2026, the Zenbook A16 somehow delivers a massive 16-inch 3K OLED touchscreen while staying remarkably calm on the scale at just 2.65 pounds.
Jackery debuts a trio of new products ahead of CES, including a new robot: Jackery is the brand that helped popularize portable power stations, and for CES 2026, the company is introducing three new products — the Jackery Explorer 1500 Ultra, the Jackery Solar Gazebo, and the Jackery Solar Mars Bot.

5. THE LATENT SPACE

❝

"Premature optimization is the root of all evil."

Donald Knuth

But Premature Filtering is the root of all good RAG pipelines. Don't ask the LLM to read garbage. It’s polite; it will try to make sense of it, and it will hallucinate to please you. Be ruthless with your data before it ever hits the prompt context.

Code at dawn.

See you tomorrow.
Harsh Kathiriya - Query & Context

Your Vector DB is fast. It is also dumb.

Query & Context

Quality over Quantity.

1. THE CONCEPT: Bi-Encoders vs. Cross-Encoders

2. THE CODE: The 2-Stage Retrieval Pattern

3. THE CEREBRAL GYM: Solution & New Puzzle

3. THE CEREBRAL GYM: Solutions & Whiteboarding

Yesterday's Solution (Python Traps)

Today's Puzzle (SQL Optimization)

4. THE PULSE

5. THE LATENT SPACE

Keep Reading

Query & Context