The "Copy-Paste" Problem.
Welcome to Part 1 of Data Engineering 101. Before we can analyze data, we have to move it from its source (where it's created) to a destination (where it's analyzed).
This sounds easy. "Just write a script to query the database and save it as a CSV."
Here is what happens when you do that in the real world:
You write a script to
SELECT * FROM transactionson the production e-commerce database.The query locks the table.
Customers can't buy anything.
The CTO calls you.
Data Ingestion is the art of moving data from Point A to Point B reliably, efficiently, and—most importantly—without impacting the source system.
To do this, you have two fundamental architectural choices.
The Great Debate: Batch vs. Streaming
Forget the tools for a second. Forget Kafka and Spark. Ingestion comes down to one question: How much latency can you tolerate?
Do you need the data now, or is tomorrow morning okay?
1. The Cargo Ship: Batch Processing
Think of Batch Ingestion like a massive cargo container ship.
How it works: It sits at the dock, waiting for data to pile up. Once a day (or hour), it loads everything onboard and sets sail. It moves a massive amount of stuff all at once.
The Vibe: "Give me a file containing everything that happened yesterday."
Pros: It's efficient. It's easier to manage. If it fails, you just rerun the job.
Cons: High latency. Your dashboard is always looking at yesterday's news.
Use Case: End-of-day financial reconciliation, training large ML models.
2. The Water Pipe: Streaming
Think of Streaming Ingestion like turning on a faucet.
How it works: As soon as a drop of water (a data event) appears at the source, it immediately flows down the pipe to the destination. It never stops.
The Vibe: "Tell me right now the exact second a user clicks 'Buy'."
Pros: Real-time insights. Immediate reaction to fraud or failures.
Cons: High complexity. If the pipe bursts, water goes everywhere, and you can't just "rerun the water." You need robust error handling (dead letter queues).
Use Case: Fraud detection, live inventory counts, cybersecurity alerting.
The Backend Developer's Bridge: CDC
If you come from the backend world, you know databases (Postgres, MySQL). You know they have a transaction log (WAL or binlog) that records every insert, update, and delete for crash recovery.
The smartest way to do ingestion today is Change Data Capture (CDC).
It’s a hybrid. You read that transaction log as a stream. Instead of querying the database and locking tables, you quietly listen to the log file in real-time. It has near-zero impact on production performance, but gives you streaming data.
It's the best of both worlds.
The Library: Required Reading
The Log: What every software engineer should know about real-time data's unifying abstraction This is a legendary blog post by Jay Kreps (co-founder of Confluent/Kafka). It explains why the "commit log" is the heart of streaming. It's long, but essential. Link: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
"Batch vs. Streaming" (Fundamentals of Data Engineering) Chapter 3 of the book we recommended yesterday goes deep into the trade-offs between these two fundamentally different approaches to time.
The Takeaway
Batch is a cargo ship: Efficient, high volume, slow. Streaming is a water pipe: Immediate, complex, fast.
Don't let anyone tell you one is "better." It depends entirely on how fast the business needs the answer.
Tomorrow, we look at where that ship (or pipe) is heading: Storage.
See you tomorrow.
Harsh Kathiriya - Query & Context

