From ETL to EtLT: Why Vectorization belongs upstream

Welcome to the Shift.

Happy New Year.

If you are reading this, you are part of the "Day 1" crew. Thank you for trusting me with your inbox.

For the next 365 days, we are going to disassemble the Modern Data Stack and rebuild it for the AI age. We aren't just building pipelines anymore; we are building context engines.

Let’s get to work.

1. THE CORE: From ETL to EtLT

The "Little t" is Vectorization.

For the last five years, the industry standard has been ELT (Extract, Load, Transform) or ETL (Extract, Transform, Load). We dumped raw JSON into Snowflake or Databricks, and let dbt handle the cleaning later. It was efficient. It was scalable.

But for AI applications, ELT is becoming a bottleneck.

Why? Because Large Language Models (LLMs) do not consume raw text; they consume Vectors (embeddings).

If you follow the traditional ELT pattern for a RAG (Retrieval Augmented Generation) pipeline, your flow looks like this:

Extract: Pull data from API.
Load: Write raw text to Data Warehouse.
Transform (SQL): Clean the data.
Sync (Reverse ETL): Read the data back out, send it to an Embedding Model (OpenAI/Cohere), receive the vectors, and write them to a Vector DB (Pinecone/Weaviate).

The Problem: This introduces massive latency and unnecessary data movement (egress costs). You are moving data three times just to make it searchable.

The New Pattern: EtLT

In 2026, the most efficient architectures are moving the embedding step upstream. This is the "little t", a micro-transformation that happens in-flight.

The Flow:

Extract: Pull data from API.
transform (In-Memory): Chunk the text and hit the Embedding API immediately.
Load: Write the Metadata to your Warehouse AND the Vectors to your Vector DB simultaneously.

Why this wins:

Zero Latency: Your data is "AI-Ready" the millisecond it lands.
Atomic Consistency: The text in your Warehouse and the vector in your Pinecone index are always in sync because they were written by the same process.

2. THE TOOLKIT: Smart Batching for Embeddings

Most engineers write simple loops to process embeddings.

Bad approach: Loop through every record and hit the OpenAI API. (Slow, hits rate limits).
Better approach: Send a giant list at once. (Fails due to memory overflow).

The Production Approach: Use Python Generators to lazy-load batches. This keeps your memory footprint low even if you are processing 10GB of text.

import openai
from typing import Generator, List

# 1. The Generator
# This yields chunks of data without loading the whole file into RAM
def generate_batches(data: List[str], batch_size: int = 100) -> Generator:
    for i in range(0, len(data), batch_size):
        yield data[i : i + batch_size]

# 2. The In-Flight Transformation
def vector_etlt_process(raw_records: List[dict]):
    # Extract just the text content we need to embed
    texts = [r['content'] for r in raw_records]
    
    # Process in batches using the generator
    for batch_index, batch_texts in enumerate(generate_batches(texts)):
        try:
            # The "little t": Embedding happens here
            response = openai.embeddings.create(
                input=batch_texts,
                model="text-embedding-3-small"
            )
            
            vectors = [item.embedding for item in response.data]
            
            # Yield the result immediately for the "Load" step
            # We zip the original ID with the new Vector
            current_ids = [r['id'] for r in raw_records[batch_index*100 : (batch_index+1)*100]]
            yield list(zip(current_ids, vectors))
            
        except Exception as e:
            print(f"Batch failed: {e}")
            # In production, send this to a Dead Letter Queue (DLQ)

# Usage:
# for vector_batch in vector_etlt_process(large_dataset):
#     pinecone_index.upsert(vectors=vector_batch)

3. THE PULSE: Industry Signals

Vector Database Wars: Pinecone and Weaviate are pushing "Serverless" architectures aggressively. In 2026, paying for idle pods is officially an anti-pattern.
Databricks vs. Snowflake: The battleground has shifted from "Who has the best SQL engine?" to "Who has the best Native Vector integration?" Snowflake's Cortex is making a strong play to keep data inside the perimeter.
OpenAI Costs: While model prices keep dropping, embedding costs are becoming a larger percentage of the bill for RAG-heavy applications. Optimization here (like caching embeddings) is high-ROI work.

4. THE CEREBRAL GYM: SQL Logic

Keep your SQL sharp. Reply with your answer.

The Challenge: You have a streaming table user_clicks that receives duplicate events due to network retries. You need to keep only the latest event for each click_id.

The Table: click_id (varchar), clicked_at (timestamp), metadata (json)

The Question: DELETE with a JOIN is often slow on massive tables. How would you solve this using a Window Function logic in a modern warehouse (Snowflake/BigQuery)?

(I'll share the most elegant solution in tomorrow's edition).

5. THE LATENT SPACE

❝

"Data matures like wine, applications like fish."

James Governor

As Data Engineers, we often obsess over the application layer (the fish). We want the newest tool, the newest framework. But our primary responsibility is the data (the wine).

If you structure your data correctly - clean schemas, proper vectorization, good governance, it gets more valuable over time as AI models get smarter. If you build for the "tool of the week," you're just selling expired fish.

See you tomorrow.

— Harsh Kathiriya