The Geometry of Data.
Welcome to Part 2 of Data Engineering 101.
Yesterday, we moved data from the source to our system (Ingestion). Now, we have to store it.
If you come from a web development background, your muscle memory says: "Spin up a Postgres or MySQL instance."
For Data Engineering, this is often the wrong move. To understand why, we need to look at the physics of the hard drive.
1. The Problem: The Library Scan
Imagine you have a physical library representing your database. Each book is a "User." Inside the book, you have lines for: Name, Email, Address, Last_Purchase_Price.
The Row-Store (Postgres/MySQL) These databases write data "horizontally." It’s like placing the books on the shelf in order of ID. If you want to calculate the Total Revenue (SUM(Price)), you have to:
Pull the first book off the shelf.
Open it.
Scan past the Name, Email, and Address to find the Price at the bottom.
Write it down.
Repeat for 1 Billion Books.
You are reading terabytes of useless data (Names, Emails) just to add up one column of numbers. This is why your analytical queries are slow.
2. The Solution: The Column-Store
Data Warehouses (Snowflake, BigQuery) and modern file formats (Parquet) use Columnar Storage.
Instead of storing "Books" (Users), they store "Lists."
List 1: All Names.
List 2: All Emails.
List 3: All Prices.
The same query (SUM(Price)):
The database ignores List 1 and List 2 completely.
It grabs only List 3 (Prices).
It sums them up in a continuous stream.
The Result:
Speed: Queries run 10x–100x faster because the disk head doesn't have to jump around.
Cost: "Prices" are all numbers. We can compress a list of numbers much better than a mix of text and numbers. This reduces storage costs by ~60%.
3. The Format: CSV vs. Parquet
This brings us to the most common rookie mistake in Data Engineering: Using CSVs for everything.
CSV (Comma Separated Values): This is a Row-Store format. It is text-based, heavy, and slow to parse. It is meant for humans, not computers.
Parquet: This is a Column-Store format. It is binary, compressed, and retains the "Schema" (it knows that
Priceis an Integer, not a String).
The Rule: Use CSVs to look at a sample of 10 rows in Excel. Use Parquet to store 10 million rows in your pipeline.
Hands-On Resource: The "Zoomcamp"
I promised to help you build real skills. The best way to understand this is to do it.
The Resource: DataTalksClub Data Engineering Zoomcamp (Week 1) This is arguably the best free Data Engineering course on the internet.
The Project: In Week 1, they have a Docker setup that ingests NY Taxi data (Postgres) and introduces you to the environment.
The Specific Lab: Look at their section on "Docker + SQL". It forces you to deal with ingestion types.
Challenge: Clone the repo. Try to run the docker-compose.yaml file. Even if you don't finish the code, just getting the environment running is a massive win.
The Takeaway
Standard Databases (Row-Stores) are optimized for writing single records (Transactions). Analytical Databases (Column-Stores) are optimized for reading massive groups of records (Analytics).
Don't use a hammer to drive a screw.
Tomorrow, we zoom out to look at the building where we keep these files: The Data Lake vs. The Warehouse.
See you tomorrow.
Harsh Kathiriya - Query & Context

