In traditional software, we write assertions:
assert add(2, 2) == 4. It passes, or it fails. It is binary.
In AI Engineering, we change a prompt and say: "Let me run it a few times and see if the answer feels better."
This is called "Vibe Checking." It is not engineering. It is gambling.
When you tweak a prompt to fix one edge case, you often silently break three others. If you don't have a systematic way to measure quality, you cannot optimize. You need Evals (Evaluations).
1. The Concept: LLM-as-a-Judge
You cannot use string matching (assert result == "Paris") because the model might answer "The capital is Paris." The test would fail, but the answer is correct.
The industry standard pattern is LLM-as-a-Judge. You use a strong model (GPT-4) to grade the output of your application model.
You define a Rubric:
Factuality: Does the answer contradict the provided context?
Tone: Is the answer professional?
Conciseness: Is the answer under 50 words?
2. The Code: Writing a Pytest Eval
You can integrate this directly into your existing testing framework (pytest).
import pytest
from openai import OpenAI
client = OpenAI()
def llm_judge(question, answer, context):
# We ask GPT-4 to act as the teacher
prompt = f"""
You are a grader.
Question: {question}
Context: {context}
Student Answer: {answer}
Does the Student Answer directly answer the Question using ONLY the Context?
Reply with 'YES' or 'NO'.
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content.strip()
# The "Unit Test"
def test_rag_hallucination():
context = "The store is open from 9am to 5pm."
question = "When does the store close?"
# 1. Run your App
my_app_answer = my_rag_app.query(question, context)
# Output: "It closes at 5pm."
# 2. Run the Judge
grade = llm_judge(question, my_app_answer, context)
# 3. Assert
assert grade == "YES", f"Hallucination detected! Output: {my_app_answer}"
Now, every time you change your system prompt, you run pytest. If the pass rate drops from 95% to 80%, you do not deploy.
3. THE CEREBRAL GYM: Solution & New Puzzle
Yesterday's solution (Database Indexing)
The puzzle was: You have an index on (last_name, first_name). Query A: WHERE last_name = 'Smith' Query B: WHERE first_name = 'John' Which is slow? The Answer: Query B is slow. This is the Left-Prefix Rule. A composite index is like a phone book. It is sorted by Last Name first. You can easily find "Smith", but you cannot find "John" without reading every single page of the book. The index is useless if you skip the first column.
Today's puzzle (Distributed Systems) You are designing a distributed database. According to the CAP Theorem, you can only have two of the three properties:
Consistency (Every read receives the most recent write).
Availability (Every request receives a (non-error) response).
Partition Tolerance (The system continues to operate despite network messages being dropped).
The Question: In a real-world distributed system (like the internet) where network failures will happen, which of the three is non-negotiable? (Meaning, you effectively only have a choice between the other two).
(Reply with the non-negotiable one!)
4. THE PULSE: Tools of the Week
Don't build your own eval framework. Use these.
Promptfoo The CLI tool for AI engineering. You define test cases in a simple YAML file (inputs and expected outputs), and
promptfooruns a matrix comparison of different models and prompts against each other. It generates a beautiful viewable report.DeepEval An open-source evaluation framework specifically for RAG. It comes with pre-built metrics like "Context Recall," "Faithfulness," and "Answer Relevancy" so you don't have to write the Judge prompts yourself.
Braintrust If you are enterprise, this is the stack. It combines logging, prompt playground, and evaluations into one platform. It feels like "DataDog for LLMs."
5. THE LATENT SPACE
"If you can't measure it, you can't improve it."
We spent the last year being amazed that LLMs could talk. Now we are in the phase where we need them to talk correctly. The era of the "Demo" is over. The era of the "Reliable Product" has begun.
Start testing.
See you tomorrow.
Harsh Kathiriya - Query & Context

