The Geometry of Data.

Welcome to Part 2 of Data Engineering 101.

Yesterday, we moved data from the source to our system (Ingestion). Now, we have to store it.

If you come from a web development background, your muscle memory says: "Spin up a Postgres or MySQL instance."

For Data Engineering, this is often the wrong move. To understand why, we need to look at the physics of the hard drive.

1. The Problem: The Library Scan

Imagine you have a physical library representing your database. Each book is a "User." Inside the book, you have lines for: Name, Email, Address, Last_Purchase_Price.

The Row-Store (Postgres/MySQL) These databases write data "horizontally." It’s like placing the books on the shelf in order of ID. If you want to calculate the Total Revenue (SUM(Price)), you have to:

  1. Pull the first book off the shelf.

  2. Open it.

  3. Scan past the Name, Email, and Address to find the Price at the bottom.

  4. Write it down.

  5. Repeat for 1 Billion Books.

You are reading terabytes of useless data (Names, Emails) just to add up one column of numbers. This is why your analytical queries are slow.

2. The Solution: The Column-Store

Data Warehouses (Snowflake, BigQuery) and modern file formats (Parquet) use Columnar Storage.

Instead of storing "Books" (Users), they store "Lists."

  • List 1: All Names.

  • List 2: All Emails.

  • List 3: All Prices.

The same query (SUM(Price)):

  1. The database ignores List 1 and List 2 completely.

  2. It grabs only List 3 (Prices).

  3. It sums them up in a continuous stream.

The Result:

  • Speed: Queries run 10x–100x faster because the disk head doesn't have to jump around.

  • Cost: "Prices" are all numbers. We can compress a list of numbers much better than a mix of text and numbers. This reduces storage costs by ~60%.

3. The Format: CSV vs. Parquet

This brings us to the most common rookie mistake in Data Engineering: Using CSVs for everything.

  • CSV (Comma Separated Values): This is a Row-Store format. It is text-based, heavy, and slow to parse. It is meant for humans, not computers.

  • Parquet: This is a Column-Store format. It is binary, compressed, and retains the "Schema" (it knows that Price is an Integer, not a String).

The Rule: Use CSVs to look at a sample of 10 rows in Excel. Use Parquet to store 10 million rows in your pipeline.

Hands-On Resource: The "Zoomcamp"

I promised to help you build real skills. The best way to understand this is to do it.

The Resource: DataTalksClub Data Engineering Zoomcamp (Week 1) This is arguably the best free Data Engineering course on the internet.

  • The Project: In Week 1, they have a Docker setup that ingests NY Taxi data (Postgres) and introduces you to the environment.

  • The Specific Lab: Look at their section on "Docker + SQL". It forces you to deal with ingestion types.

Challenge: Clone the repo. Try to run the docker-compose.yaml file. Even if you don't finish the code, just getting the environment running is a massive win.

The Takeaway

Standard Databases (Row-Stores) are optimized for writing single records (Transactions). Analytical Databases (Column-Stores) are optimized for reading massive groups of records (Analytics).

Don't use a hammer to drive a screw.

Tomorrow, we zoom out to look at the building where we keep these files: The Data Lake vs. The Warehouse.

See you tomorrow.
Harsh Kathiriya - Query & Context

Keep Reading