Query & Context

The "Copy-Paste" Problem.

Welcome to Part 1 of Data Engineering 101. Before we can analyze data, we have to move it from its source (where it's created) to a destination (where it's analyzed).

This sounds easy. "Just write a script to query the database and save it as a CSV."

Here is what happens when you do that in the real world:

You write a script to SELECT * FROM transactions on the production e-commerce database.
The query locks the table.
Customers can't buy anything.
The CTO calls you.

Data Ingestion is the art of moving data from Point A to Point B reliably, efficiently, and—most importantly—without impacting the source system.

To do this, you have two fundamental architectural choices.

The Great Debate: Batch vs. Streaming

Forget the tools for a second. Forget Kafka and Spark. Ingestion comes down to one question: How much latency can you tolerate?

Do you need the data now, or is tomorrow morning okay?

1. The Cargo Ship: Batch Processing

Think of Batch Ingestion like a massive cargo container ship.

How it works: It sits at the dock, waiting for data to pile up. Once a day (or hour), it loads everything onboard and sets sail. It moves a massive amount of stuff all at once.
The Vibe: "Give me a file containing everything that happened yesterday."
Pros: It's efficient. It's easier to manage. If it fails, you just rerun the job.
Cons: High latency. Your dashboard is always looking at yesterday's news.
Use Case: End-of-day financial reconciliation, training large ML models.

2. The Water Pipe: Streaming

Think of Streaming Ingestion like turning on a faucet.

How it works: As soon as a drop of water (a data event) appears at the source, it immediately flows down the pipe to the destination. It never stops.
The Vibe: "Tell me right now the exact second a user clicks 'Buy'."
Pros: Real-time insights. Immediate reaction to fraud or failures.
Cons: High complexity. If the pipe bursts, water goes everywhere, and you can't just "rerun the water." You need robust error handling (dead letter queues).
Use Case: Fraud detection, live inventory counts, cybersecurity alerting.

The Backend Developer's Bridge: CDC

If you come from the backend world, you know databases (Postgres, MySQL). You know they have a transaction log (WAL or binlog) that records every insert, update, and delete for crash recovery.

The smartest way to do ingestion today is Change Data Capture (CDC).

It’s a hybrid. You read that transaction log as a stream. Instead of querying the database and locking tables, you quietly listen to the log file in real-time. It has near-zero impact on production performance, but gives you streaming data.

It's the best of both worlds.

The Library: Required Reading

The Log: What every software engineer should know about real-time data's unifying abstraction This is a legendary blog post by Jay Kreps (co-founder of Confluent/Kafka). It explains why the "commit log" is the heart of streaming. It's long, but essential. Link: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
"Batch vs. Streaming" (Fundamentals of Data Engineering) Chapter 3 of the book we recommended yesterday goes deep into the trade-offs between these two fundamentally different approaches to time.

The Takeaway

Batch is a cargo ship: Efficient, high volume, slow. Streaming is a water pipe: Immediate, complex, fast.

Don't let anyone tell you one is "better." It depends entirely on how fast the business needs the answer.

Tomorrow, we look at where that ship (or pipe) is heading: Storage.

See you tomorrow.
Harsh Kathiriya - Query & Context

How to extract data without crashing production

Query & Context

The "Copy-Paste" Problem.

The Great Debate: Batch vs. Streaming

1. The Cargo Ship: Batch Processing

2. The Water Pipe: Streaming

The Backend Developer's Bridge: CDC

The Library: Required Reading

The Takeaway

Keep Reading

Query & Context