Welcome to the next part of Data Engineering 101.
So far, we have built a pipeline that streams data into a Data Lakehouse. Now, we have a choice. The raw data is messy (JSON blobs, missing fields, duplicates). Do we clean it before we save it to the database, or after?
For 20 years, we did it before. We were wrong.
The Old Way: ETL (Extract, Transform, Load)
In the 90s, storage was expensive ($50,000 per TB). You couldn't afford to store "Junk." So you had to clean the data in transit.
Extract: Pull data from Source.
Transform: Clean it in memory (Python/Java script). <-- The Danger Zone
Load: Save the clean result to the Warehouse.
The Fatal Flaw: If your Transformation script has a bug (e.g., it accidentally deletes all users from "Canada"), the data is gone forever. You didn't save the raw input. You only saved the broken output. You cannot "replay" the past.
The Modern Way: ELT (Extract, Load, Transform)
Today, storage is cheap ($20 per TB on S3). We don't care about saving space. We care about saving history.
Extract: Pull data from Source.
Load: Dump the RAW, messy data directly into the Lakehouse.
Transform: Use SQL to clean it inside the database.
The Superpower: If your cleaning script has a bug, who cares? You still have the raw data saved in Step 2. You just fix the script and run it again. ELT turns "Data Engineering" from a high-stakes surgery into a reversible experiment.
The Architecture: Bronze, Silver, Gold
This shift to ELT gave birth to the "Medallion Architecture" (popularized by Databricks), which is the standard for modern pipelines:
🥉 Bronze Layer (Raw): The exact copy of the source data. Messy, duplicated, ugly. We never delete this.
🥈 Silver Layer (Clean): Deduplicated, formatted, and lightly cleaned. (e.g., Timestamp strings converted to Date objects).
🥇 Gold Layer (Aggregated): Business-level metrics ready for dashboards. (e.g., "Daily Active Users").
Data flows from Bronze → Silver → Gold. If you mess up Gold, you just rebuild it from Silver.
The Library: Required Reading
dbt (data build tool) dbt is the tool that killed ETL. It allows you to run SQL
SELECTstatements to transform data inside your warehouse. It is the standard for ELT. Link: getdbt.com/product/what-is-dbt"The Medallion Architecture" A short read on how to organize your data quality layers. Link: databricks.com/glossary/medallion-architecture
Have a great day!
Thank you,
Harsh Kathiriya
