Stop waiting for the server.

We start the week by fixing the most common performance bottleneck in Python AI applications: Serial Execution.

If you are building an agent that needs to:

  1. Search Google (1s)

  2. Query the Vector DB (0.5s)

  3. Call GPT-4 (3s)

In standard Python, this takes 1 + 0.5 + 3 = 4.5 seconds. Your code sits idle, doing absolutely nothing while waiting for the network packets to return. This is called Blocking I/O.

In 2026, LLMs are slow. If you run them in a serial chain, your application will feel sluggish. You need to be running these tasks in Parallel.

1. The Concept: The Event Loop

You don't need multi-threading (which is hard). You need AsyncIO.

Imagine a waiter in a restaurant.

  • Synchronous: The waiter takes Order A, walks to the kitchen, waits there for 20 minutes until the food is cooked, brings it back, and only then takes Order B.

  • Asynchronous: The waiter takes Order A, hands it to the kitchen, and immediately goes to take Order B while the kitchen cooks.

LLM calls are "Kitchen work." Your Python script is the "Waiter." Don't let the waiter stand in the kitchen.

2. THE FIX: The "Idempotency Check" & Separation

Most tutorials show you requests.get(). This is a blocking library. You need to switch to httpx or aiohttp and use asyncio.gather.

The Old Way (Serial - 4.5s):

import time

def get_google():
    time.sleep(1) # Simulating network lag
    return "Google Result"

def get_vector_db():
    time.sleep(1)
    return "Vector Result"

# Total time: 2 seconds
results = [get_google(), get_vector_db()]

The New Way (Parallel - 1.0s):

import asyncio

async def get_google():
    await asyncio.sleep(1) # Yields control
    return "Google Result"

async def get_vector_db():
    await asyncio.sleep(1) # Yields control
    return "Vector Result"

async def main():
    # This runs BOTH functions at the exact same time.
    # Total time is determined by the slowest task (1s), not the sum.
    results = await asyncio.gather(get_google(), get_vector_db())

# Total time: 1 second
asyncio.run(main())

The Unlock: If your Agent needs to generate 3 different summaries for 3 different documents, do not loop through them. gather them. You get 3x the speed for free.

3. THE CEREBRAL GYM: Solution & New Puzzle

Yesterday's solution (Unique IDs)

The puzzle was: How do Server A and Server B generate unique IDs without talking to each other and without colliding?

The Answer: Snowflake IDs (Twitter Architecture).

You use a 64-bit integer, but you split the bits:

  • 41 bits: Timestamp (Milliseconds)

  • 10 bits: Machine ID / Node ID (Server A is 01, Server B is 02).

  • 12 bits: Sequence Number (Counter). Because the "Machine ID" is hardcoded into the ID generation, it is mathematically impossible for Server A to generate the same ID as Server B, even at the exact same millisecond.

Today's puzzle (Database Indexing) Monday is for SQL.

You have a table users with columns: last_name, first_name, age. You create a Composite Index on (last_name, first_name).

Query A: SELECT * FROM users WHERE last_name = 'Smith' Query B: SELECT * FROM users WHERE first_name = 'John'

The Question: One of these queries will be instant (uses the index). The other will be slow (Full Table Scan). Which one is slow, and why?

(Reply with the slow one!)

4. THE PULSE: Tools of the day

Speed is the theme. Here are 3 tools to make your code fly.

  • FastAPI (The Async Standard) If you are still using Flask for your AI API, stop. Flask is blocking. FastAPI is built on Starlette and fully supports async/await out of the box. It is the gold standard for serving LLMs.

  • 🐻 Polars (The Panda Killer) Pandas is great, but it is single-threaded and memory-hungry. Polars is a DataFrame library written in Rust. It uses all cores of your CPU and is lazy-evaluated. It processes millions of rows in milliseconds where Pandas would crash.

  • 🦗 Locust (Load Testing) How do you know if your Async code is working? You need to hammer it. Locust allows you to spawn thousands of concurrent "Users" in Python to attack your API and visualize the latency.

5. THE LATENT SPACE

"Time is the only resource you cannot scale."

We buy more RAM. We rent more GPUs. But we cannot buy more seconds. If your user has to wait 10 seconds for an answer, they will leave. Concurrency isn't just an optimization; it's user retention.

Don't make them wait.

See you tomorrow.
Harsh Kathiriya - Query & Context

Keep Reading