Query & Context

The Geometry of Data.

Welcome to Part 2 of Data Engineering 101.

Yesterday, we moved data from the source to our system (Ingestion). Now, we have to store it.

If you come from a web development background, your muscle memory says: "Spin up a Postgres or MySQL instance."

For Data Engineering, this is often the wrong move. To understand why, we need to look at the physics of the hard drive.

1. The Problem: The Library Scan

Imagine you have a physical library representing your database. Each book is a "User." Inside the book, you have lines for: Name, Email, Address, Last_Purchase_Price.

The Row-Store (Postgres/MySQL) These databases write data "horizontally." It’s like placing the books on the shelf in order of ID. If you want to calculate the Total Revenue (SUM(Price)), you have to:

Pull the first book off the shelf.
Open it.
Scan past the Name, Email, and Address to find the Price at the bottom.
Write it down.
Repeat for 1 Billion Books.

You are reading terabytes of useless data (Names, Emails) just to add up one column of numbers. This is why your analytical queries are slow.

2. The Solution: The Column-Store

Data Warehouses (Snowflake, BigQuery) and modern file formats (Parquet) use Columnar Storage.

Instead of storing "Books" (Users), they store "Lists."

List 1: All Names.
List 2: All Emails.
List 3: All Prices.

The same query (SUM(Price)):

The database ignores List 1 and List 2 completely.
It grabs only List 3 (Prices).
It sums them up in a continuous stream.

The Result:

Speed: Queries run 10x–100x faster because the disk head doesn't have to jump around.
Cost: "Prices" are all numbers. We can compress a list of numbers much better than a mix of text and numbers. This reduces storage costs by ~60%.

3. The Format: CSV vs. Parquet

This brings us to the most common rookie mistake in Data Engineering: Using CSVs for everything.

CSV (Comma Separated Values): This is a Row-Store format. It is text-based, heavy, and slow to parse. It is meant for humans, not computers.
Parquet: This is a Column-Store format. It is binary, compressed, and retains the "Schema" (it knows that Price is an Integer, not a String).

The Rule: Use CSVs to look at a sample of 10 rows in Excel. Use Parquet to store 10 million rows in your pipeline.

Hands-On Resource: The "Zoomcamp"

I promised to help you build real skills. The best way to understand this is to do it.

The Resource: DataTalksClub Data Engineering Zoomcamp (Week 1) This is arguably the best free Data Engineering course on the internet.

The Project: In Week 1, they have a Docker setup that ingests NY Taxi data (Postgres) and introduces you to the environment.
The Specific Lab: Look at their section on "Docker + SQL". It forces you to deal with ingestion types.

Repository: github.com/DataTalksClub/data-engineering-zoomcamp

Challenge: Clone the repo. Try to run the docker-compose.yaml file. Even if you don't finish the code, just getting the environment running is a massive win.

The Takeaway

Standard Databases (Row-Stores) are optimized for writing single records (Transactions). Analytical Databases (Column-Stores) are optimized for reading massive groups of records (Analytics).

Don't use a hammer to drive a screw.

Tomorrow, we zoom out to look at the building where we keep these files: The Data Lake vs. The Warehouse.

See you tomorrow.
Harsh Kathiriya - Query & Context

Why you shouldn't use Postgres for analytics

Query & Context

The Geometry of Data.

1. The Problem: The Library Scan

2. The Solution: The Column-Store

3. The Format: CSV vs. Parquet

Hands-On Resource: The "Zoomcamp"

The Takeaway

Keep Reading

Query & Context