Welcome to the next part of Data Engineering 101.

So far, we have built a pipeline that streams data into a Data Lakehouse. Now, we have a choice. The raw data is messy (JSON blobs, missing fields, duplicates). Do we clean it before we save it to the database, or after?

For 20 years, we did it before. We were wrong.

The Old Way: ETL (Extract, Transform, Load)

In the 90s, storage was expensive ($50,000 per TB). You couldn't afford to store "Junk." So you had to clean the data in transit.

  1. Extract: Pull data from Source.

  2. Transform: Clean it in memory (Python/Java script). <-- The Danger Zone

  3. Load: Save the clean result to the Warehouse.

The Fatal Flaw: If your Transformation script has a bug (e.g., it accidentally deletes all users from "Canada"), the data is gone forever. You didn't save the raw input. You only saved the broken output. You cannot "replay" the past.

The Modern Way: ELT (Extract, Load, Transform)

Today, storage is cheap ($20 per TB on S3). We don't care about saving space. We care about saving history.

  1. Extract: Pull data from Source.

  2. Load: Dump the RAW, messy data directly into the Lakehouse.

  3. Transform: Use SQL to clean it inside the database.

The Superpower: If your cleaning script has a bug, who cares? You still have the raw data saved in Step 2. You just fix the script and run it again. ELT turns "Data Engineering" from a high-stakes surgery into a reversible experiment.

The Architecture: Bronze, Silver, Gold

This shift to ELT gave birth to the "Medallion Architecture" (popularized by Databricks), which is the standard for modern pipelines:

  • 🥉 Bronze Layer (Raw): The exact copy of the source data. Messy, duplicated, ugly. We never delete this.

  • 🥈 Silver Layer (Clean): Deduplicated, formatted, and lightly cleaned. (e.g., Timestamp strings converted to Date objects).

  • 🥇 Gold Layer (Aggregated): Business-level metrics ready for dashboards. (e.g., "Daily Active Users").

Data flows from BronzeSilverGold. If you mess up Gold, you just rebuild it from Silver.

The Library: Required Reading

Have a great day!

Thank you,

Harsh Kathiriya

Keep Reading