ETL vs ELT was just the beginning: how to organize raw, staging, and marts layers
How to organize raw, staging, and marts layers in a real data pipeline. A practical follow-up to the ETL vs ELT article.

A few weeks ago I published an article about the differences between ETL and ELT. The most common response was something like: “Okay, I now know when to transform before or after loading… but how do I organize the layers inside the warehouse?”. The question makes perfect sense. Knowing whether you do ETL or ELT is the first decision, but what determines whether your pipeline is maintainable or a slow-motion disaster is how you structure what’s inside.
This article doesn’t rehash the ETL vs ELT comparison. It assumes you already have data arriving at a destination and need to decide what to store, where to transform, and what to expose. The three classic layers---raw, staging, and marts---are the answer that works best in practice. Not because they’re the only option, but because they solve real problems that always show up: data gets corrupted, transformations nobody understands, and metrics that don’t add up because someone accidentally modified the raw data.
Why you need layers and not one giant table
The natural impulse when you start working with data is to create a single table where you dump everything. A CSV that you read, clean, and load. A script that connects to an API, transforms the JSON, and inserts it directly into the final table. It works on day one. By the third month, you have a problem.
The main reason for separating into layers is not aesthetic or academic. It’s that without separation, any error in the transformation destroys the original data and you can’t recover it.
I’ve seen pipelines where someone decided to “clean” the data directly in the only table that existed. When we discovered that the cleaning had a bug that was deleting valid records, there was no way to reconstruct them. The raw data no longer existed. That doesn’t happen if you respect the layers.
Layers solve three concrete problems:
- Immutability of the raw data. You can always go back to the source.
- Separation of concerns. Each layer has a clear purpose.
- Debuggability. If a metric doesn’t add up, you can trace which layer broke it.
The raw layer: the data exactly as it arrives
The raw layer is the warehouse for what you receive. No filters, no transformations, no opinions. If an API returns a JSON with empty fields, duplicates, and timestamps in a weird format, that is exactly what you store.
What to store in raw
- The complete data exactly as it arrives from the source.
- An ingestion timestamp (
_ingested_at) to know when it was received. - Optionally, source metadata: source name, schema version, pipeline execution ID.
What NOT to do in raw
- Don’t clean nulls.
- Don’t change data types.
- Don’t deduplicate.
- Don’t filter records that “look like garbage.”
The temptation to “fix things up a little” in raw is the most common mistake I’ve seen. A colleague once told me: “I just removed the duplicates before storing them, to save space.” The problem came when it turned out those “duplicates” were actually updates of the same record with different timestamps, and by deduplicating we lost the change history.
Example: raw ingestion in SQL
CREATE TABLE raw.orders (
_ingested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
_source VARCHAR(50) DEFAULT 'api_ecommerce_v2',
_batch_id VARCHAR(36),
payload JSONB -- complete raw data, untouched
);
-- Insertion: we dump the JSON as-is
INSERT INTO raw.orders (_batch_id, payload)
VALUES (
'batch-2026-05-18-001',
'{"order_id": "A1234", "customer": "jdoe", "total": "89.99", "status": "pending", "created_at": "2026-05-17T14:32:00"}'
);Example: raw ingestion in Python
import json
import psycopg2
from datetime import datetime
from uuid import uuid4
def ingest_raw_orders(orders: list[dict], conn):
"""Stores orders exactly as they arrive, without transforming anything."""
batch_id = str(uuid4())
cursor = conn.cursor()
for order in orders:
cursor.execute(
"""
INSERT INTO raw.orders (_batch_id, payload)
VALUES (%s, %s)
""",
(batch_id, json.dumps(order))
)
conn.commit()
print(f"[RAW] Ingested {len(orders)} records | batch: {batch_id}")The raw data is your life insurance. If tomorrow you change the transformation logic, if you discover a bug in staging, if the business redefines a metric, you can always go back to raw and reprocess.
The staging layer: clean, type, deduplicate
Staging is where the data starts taking shape. Here you apply technical transformations that don’t depend on business rules: correct typing, duplicate removal, null handling, format normalization.
What happens in staging
| Operation | Example |
|---|---|
| Typing | Convert "89.99" (string) to 89.99 (decimal) |
| Deduplication | Keep the most recent record per order_id |
| Nulls | Decide whether a null field is discarded, flagged, or filled |
| Normalization | Unify date formats, country codes, statuses |
| Basic validation | Reject records that don’t meet a minimum schema |
What does NOT happen in staging
- You don’t calculate business metrics.
- You don’t aggregate data (no SUM, AVG, COUNT for reports).
- You don’t apply rules that only the business can define.
The boundary between staging and marts is sometimes blurry. My rule is: if any data engineer could do the transformation without knowing the business domain, it goes in staging. If you need to talk to product or finance to know how to calculate it, it goes in marts.
Example: staging transformation in SQL
CREATE TABLE staging.orders AS
SELECT
payload->>'order_id' AS order_id,
payload->>'customer' AS customer_id,
CAST(payload->>'total' AS DECIMAL(10,2)) AS total_amount,
LOWER(payload->>'status') AS status,
CAST(payload->>'created_at' AS TIMESTAMP) AS order_created_at,
_ingested_at,
_batch_id
FROM raw.orders;
-- Deduplication: keep the most recent version of each order
DELETE FROM staging.orders
WHERE ctid NOT IN (
SELECT DISTINCT ON (order_id) ctid
FROM staging.orders
ORDER BY order_id, _ingested_at DESC
);Example: staging transformation in Python
import pandas as pd
def transform_staging_orders(raw_df: pd.DataFrame) -> pd.DataFrame:
"""Cleans and types the raw orders."""
df = raw_df.copy()
# Extract fields from JSON payload
df['order_id'] = df['payload'].apply(lambda x: x.get('order_id'))
df['customer_id'] = df['payload'].apply(lambda x: x.get('customer'))
df['total_amount'] = pd.to_numeric(
df['payload'].apply(lambda x: x.get('total')),
errors='coerce'
)
df['status'] = df['payload'].apply(
lambda x: x.get('status', '').lower()
)
df['order_created_at'] = pd.to_datetime(
df['payload'].apply(lambda x: x.get('created_at')),
errors='coerce'
)
# Remove records without order_id (corrupt data)
df = df.dropna(subset=['order_id'])
# Deduplicate: keep the most recent record per order_id
df = df.sort_values('_ingested_at', ascending=False)
df = df.drop_duplicates(subset=['order_id'], keep='first')
# Select final columns
staging_cols = [
'order_id', 'customer_id', 'total_amount',
'status', 'order_created_at', '_ingested_at', '_batch_id'
]
return df[staging_cols].reset_index(drop=True)In staging the goal is to have clean, typed, deduplicated data. Nothing more and nothing less. If you arrive here with reliable data, marts becomes a much simpler problem.
The marts layer: business models and metrics
Marts is where data becomes answers. Here you create the tables, views, and models the business consumes: dashboards, reports, KPIs, predictive models. The transformations in marts are domain-level, not technical.
What happens in marts
- Business calculations: revenue, churn, LTV, conversion rate.
- Aggregations: sales by month, orders by customer, tickets by category.
- Joins between staging entities to create business views.
- Dimensional models (fact tables, dimension tables) if your warehouse requires it.
Example: daily sales mart
CREATE TABLE marts.daily_sales AS
SELECT
DATE(order_created_at) AS sale_date,
COUNT(*) AS total_orders,
COUNT(DISTINCT customer_id) AS unique_customers,
SUM(total_amount) AS total_revenue,
AVG(total_amount) AS avg_order_value,
SUM(CASE WHEN status = 'completed'
THEN total_amount ELSE 0 END) AS confirmed_revenue
FROM staging.orders
WHERE status IN ('completed', 'pending', 'shipped')
GROUP BY DATE(order_created_at)
ORDER BY sale_date DESC;Example: customer mart for the product team
CREATE VIEW marts.customer_summary AS
SELECT
customer_id,
COUNT(*) AS total_orders,
SUM(total_amount) AS lifetime_value,
MIN(order_created_at) AS first_order_at,
MAX(order_created_at) AS last_order_at,
AVG(total_amount) AS avg_order_value,
CURRENT_DATE - MAX(order_created_at)::date AS days_since_last_order
FROM staging.orders
WHERE status = 'completed'
GROUP BY customer_id;The key difference from staging is that marts has business opinion. “Confirmed revenue” only counts orders with status completed. That rule isn’t technical; it’s a finance decision. If tomorrow they decide that shipped also counts as confirmed revenue, you change the mart, not staging or raw.
The full flow: from source to dashboard
To see it globally, this is the pipeline with the three layers:
Source (API, DB, CSV)
│
▼
┌────────┐
│ RAW │ Raw data + ingestion timestamp
│ │ Immutable, untransformed
└───┬────┘
│
▼
┌──────────┐
│ STAGING │ Typed, deduplicated, nulls handled
│ │ Technical transformations
└────┬─────┘
│
▼
┌────────┐
│ MARTS │ Metrics, KPIs, business views
│ │ Domain transformations
└───┬────┘
│
▼
Dashboard / API / ML ModelOrchestration with a simple Python script
def run_pipeline():
"""Full pipeline: raw -> staging -> marts."""
conn = get_connection()
# 1. Raw: ingest raw data
raw_orders = fetch_orders_from_api()
ingest_raw_orders(raw_orders, conn)
print("[PIPELINE] Raw completed")
# 2. Staging: clean and type
raw_df = pd.read_sql("SELECT * FROM raw.orders", conn)
staging_df = transform_staging_orders(raw_df)
staging_df.to_sql('orders', conn, schema='staging', if_exists='replace', index=False)
print("[PIPELINE] Staging completed")
# 3. Marts: create business views
with conn.cursor() as cur:
cur.execute(open('sql/marts/daily_sales.sql').read())
cur.execute(open('sql/marts/customer_summary.sql').read())
conn.commit()
print("[PIPELINE] Marts completed")
if __name__ == '__main__':
run_pipeline()Common mistakes that break pipelines
After building and maintaining several pipelines with this structure, there are error patterns that keep repeating. The most dangerous ones aren’t technical bugs---they’re design decisions that seem harmless at first.
1. Transforming in raw
The classic. “I just normalize the dates before storing them.” Seems reasonable until you discover that the source changed the date format without notice and your normalization has been discarding valid records for weeks. In raw you don’t touch anything. Period.
2. Mixing staging and marts
Putting business calculations in staging is tempting when you’re in a rush. “Since I’m already cleaning, I’ll also calculate the revenue.” The problem comes when the definition of revenue changes and you have to touch staging, which invalidates every table that depends on it.
3. Not versioning schemas
If your staging table has 12 columns today and tomorrow the source adds 3 new fields, your pipeline breaks. Or worse: it keeps working but silently ignores the new fields. Versioning schemas---even with a document or a YAML file---alerts you when something changes.
# schema_orders_v2.yaml
version: 2
source: api_ecommerce
fields:
- name: order_id
type: string
nullable: false
- name: customer
type: string
nullable: false
- name: total
type: string # arrives as string, typed in staging
nullable: false
- name: status
type: string
nullable: false
allowed_values: [pending, completed, shipped, cancelled]
- name: created_at
type: string # ISO 8601
nullable: false4. Not having _ingested_at in raw
Without an ingestion timestamp you can’t know when a piece of data arrived. If you need to reprocess “the data that arrived yesterday,” without that field you’re lost. It’s a field that costs nothing to add and saves you constantly.
5. Marts tables without business rule documentation
If someone looks at confirmed_revenue in your mart and there’s nowhere that explains it only includes orders with status completed, they’ll assume whatever suits them. Document the business rules next to the SQL or in the schema itself.
When this structure is overkill
Not everything needs three formal layers. If your pipeline is a script that reads a 200-row CSV and loads it into a Google Sheet for a monthly report, setting up raw/staging/marts would be over-engineering.
My criteria for deciding when it’s worth it:
| Situation | Raw/Staging/Marts | A direct script |
|---|---|---|
| More than one data source | Yes | Probably not |
| Data is reprocessed periodically | Yes | No |
| Multiple people consume the metrics | Yes | Depends |
| Raw data can change format | Yes | High risk |
| One-time pipeline, ad-hoc analysis | No | Yes |
| Production dashboard with SLA | Yes | Never |
If your pipeline has more than one source, runs on a recurring basis, and more than one person consumes the results, the three layers will save you trouble. It’s a small investment that scales well.
Relationship with ETL and ELT
If you’re coming from the previous article, here’s how it fits together:
- In a classic ETL flow, staging transformations happen before loading into the warehouse. Raw can be a temporary directory of files or a staging table.
- In a modern ELT flow, everything arrives first at the warehouse (raw), and the staging and marts transformations are done with SQL inside the warehouse itself. Tools like dbt fit perfectly into this model.
The raw/staging/marts structure doesn’t compete with ETL or ELT. It’s complementary. It defines the internal organization of the destination, regardless of how the data arrives.
What I wish I had known earlier
When I set up my first pipeline with these layers, I made almost every mistake I’ve described. What changed my perspective was to stop thinking of the layers as technical bureaucracy and start seeing them as a contract: raw promises immutability, staging promises cleanliness, marts promises business relevance.
Each layer has a single responsibility. And just like in code, when a piece has only one reason to change, the entire system is easier to maintain, debug, and evolve.
If you already have a working pipeline without layers, you don’t need to rewrite it all at once. Start by separating raw. Just that---storing the raw data before touching it---already changes the game.


