The "Copy-Paste" Problem.

Welcome to Part 1 of Data Engineering 101. Before we can analyze data, we have to move it from its source (where it's created) to a destination (where it's analyzed).

This sounds easy. "Just write a script to query the database and save it as a CSV."

Here is what happens when you do that in the real world:

  1. You write a script to SELECT * FROM transactions on the production e-commerce database.

  2. The query locks the table.

  3. Customers can't buy anything.

  4. The CTO calls you.

Data Ingestion is the art of moving data from Point A to Point B reliably, efficiently, and—most importantly—without impacting the source system.

To do this, you have two fundamental architectural choices.

The Great Debate: Batch vs. Streaming

Forget the tools for a second. Forget Kafka and Spark. Ingestion comes down to one question: How much latency can you tolerate?

Do you need the data now, or is tomorrow morning okay?

1. The Cargo Ship: Batch Processing

Think of Batch Ingestion like a massive cargo container ship.

  • How it works: It sits at the dock, waiting for data to pile up. Once a day (or hour), it loads everything onboard and sets sail. It moves a massive amount of stuff all at once.

  • The Vibe: "Give me a file containing everything that happened yesterday."

  • Pros: It's efficient. It's easier to manage. If it fails, you just rerun the job.

  • Cons: High latency. Your dashboard is always looking at yesterday's news.

  • Use Case: End-of-day financial reconciliation, training large ML models.

2. The Water Pipe: Streaming

Think of Streaming Ingestion like turning on a faucet.

  • How it works: As soon as a drop of water (a data event) appears at the source, it immediately flows down the pipe to the destination. It never stops.

  • The Vibe: "Tell me right now the exact second a user clicks 'Buy'."

  • Pros: Real-time insights. Immediate reaction to fraud or failures.

  • Cons: High complexity. If the pipe bursts, water goes everywhere, and you can't just "rerun the water." You need robust error handling (dead letter queues).

  • Use Case: Fraud detection, live inventory counts, cybersecurity alerting.

The Backend Developer's Bridge: CDC

If you come from the backend world, you know databases (Postgres, MySQL). You know they have a transaction log (WAL or binlog) that records every insert, update, and delete for crash recovery.

The smartest way to do ingestion today is Change Data Capture (CDC).

It’s a hybrid. You read that transaction log as a stream. Instead of querying the database and locking tables, you quietly listen to the log file in real-time. It has near-zero impact on production performance, but gives you streaming data.

It's the best of both worlds.

The Library: Required Reading

The Takeaway

Batch is a cargo ship: Efficient, high volume, slow. Streaming is a water pipe: Immediate, complex, fast.

Don't let anyone tell you one is "better." It depends entirely on how fast the business needs the answer.

Tomorrow, we look at where that ship (or pipe) is heading: Storage.

See you tomorrow.
Harsh Kathiriya - Query & Context

Keep Reading