Query & Context

Welcome to the day 15/365!

More context is not better.

There is a marketing war happening right now. Gemini offers 2 Million tokens. Claude offers 200k. GPT-4o offers 128k. The sales pitch is: "Dump your entire database into the prompt and let the AI figure it out."

Do not do this.

Just because the model accepts the text, doesn't mean it understands it. Research shows that LLMs suffer from the "Lost in the Middle" phenomenon.

They have near-perfect recall at the start of the prompt.
They have near-perfect recall at the end of the prompt.
They hallucinate or forget information buried in the middle (the bottom of the "U-Curve").

If you stuff 50 documents into your context window, the model will likely miss the specific detail in Document #25. The goal of RAG is not to give the model all the data. It is to give the model only the data that matters.

1. The Concept: Semantic Chunking

The first step to fixing this is how you cut your data. Most tutorials teach you Recursive Character Splitting. You chop the text every 500 characters.

This is brutal. You often cut a sentence in half, or separate a question from its answer.

Chunk 1: "...the refund policy is strictly"
Chunk 2: "no refunds after 30 days."

If you search for "refund policy," you might get Chunk 1, but without Chunk 2, the AI doesn't know the actual rule.

The solution is Semantic Chunking. Instead of splitting by character count, we split by meaning. We use an embedding model to scan the document. When the topic shifts (e.g., from "Refunds" to "Shipping"), the distance score spikes, and we make a cut there.

2. The Code: Implementing Semantic Splitters

We can use LangChain's experimental features to do this. It calculates the cosine similarity between sentences and only breaks the chunk when the similarity drops below a threshold.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# 1. Initialize the embedding model
# This is the "brain" that judges if two sentences are related.
embeddings = OpenAIEmbeddings()

# 2. Create the Splitter
# 'percentile' means we split at the biggest outliers in topic shifts.
splitter = SemanticChunker(
    embeddings, 
    breakpoint_threshold_type="percentile"
)

# 3. The raw text (A messy PDF extraction)
text = "The refund policy is strict. No refunds after 30 days. ... [Topic Change] ... Our shipping partners are FedEx and UPS."

# 4. Create meaningful chunks
docs = splitter.create_documents([text])

# Result:
# Chunk 1: "The refund policy is strict. No refunds after 30 days."
# Chunk 2: "Our shipping partners are FedEx and UPS."

By keeping related ideas together, your Vector Search retrieves the whole context, and the LLM doesn't have to guess what was cut off.

3. THE CEREBRAL GYM: Solution & New Puzzle

Yesterday's solution (Database Indexes)

The puzzle was: Why does a B-Tree index fail on low-cardinality columns (like is_active = true), and what should you use?

The Answer: A Bitmap Index. A B-Tree is deep and designed for unique values (IDs, Emails). For a Boolean (only 2 options), the tree is uselessly wide. A Bitmap index creates a simple string of bits: 100101.

User 1 (Active): 1
User 2 (Inactive): 0
User 3 (Inactive): 0 Database engines can combine these using bitwise logical operations (AND/OR) incredibly fast.

Today's puzzle (Distributed Locking) You have a Cron Job that runs every hour to send emails. You have 3 backend servers. You don't want all 3 servers to send the same emails (triple spam). You need a "Distributed Lock" so that only one server grabs the job. You use Redis. You set a key lock:email_job. The Failure: Server A acquires the lock. It starts working. Then Server A crashes before it can release the lock. Now the lock exists forever. No other server can ever send emails again. The Question: What is the specific feature/parameter you must add to the lock key to prevent this "Infinite Deadlock"?

(Reply with the parameter name!)

4. THE PULSE: High-Utility Drops

Stop using bad tools. Here are the 3 upgrades for your workflow.

Arize Phoenix (LLM Tracing) If you are building RAG, you are flying blind. You don't know why the retrieval failed. Phoenix is an open-source observability tool that visualizes your entire chain. You can see exactly what the Vector DB returned, what the Reranker discarded, and what the LLM saw.
vLLM (Production Serving) If you are deploying open-source models (Llama 3, Mistral) in production, do not use HuggingFace Accelerate. Use vLLM. It uses a memory technique called PagedAttention (like virtual memory in an OS) to increase throughput by 24x. It is the industry standard for self-hosting. Link: github.com/vllm-project/vllm
Rye (Python Packaging) We mentioned uv before, but Rye (built by the creator of Flask) is the all-in-one holy grail. It manages your Python version, your virtual environment, and your dependencies in a single tool. It feels like npm or cargo, but for Python.

5. THE LATENT SPACE

❝

"Information is not knowledge."

Feeding a book to an LLM is like throwing a library at a student. They might catch a few pages, but they won't learn the subject. Your job as an engineer is to be the Librarian. Curate the data. Chunk it wisely. Serve only what is needed.

Focus.

See you tomorrow.
Harsh Kathiriya - Query & Context

The 1 million token lie.

Query & Context

Welcome to the day 15/365!

More context is not better.

1. The Concept: Semantic Chunking

2. The Code: Implementing Semantic Splitters

3. THE CEREBRAL GYM: Solution & New Puzzle

Yesterday's solution (Database Indexes)

4. THE PULSE: High-Utility Drops

5. THE LATENT SPACE

Keep Reading

Query & Context