ETL vs ELT was just the beginning: how to organize raw, staging, and marts layers

How to organize raw, staging, and marts layers in a real data pipeline. A practical follow-up to the ETL vs ELT article.

Cover for ETL vs ELT was just the beginning: how to organize raw, staging, and marts layers

A few weeks ago I published an article about the differences between ETL and ELT. The most common response was something like: “Okay, I now know when to transform before or after loading… but how do I organize the layers inside the warehouse?”. The question makes perfect sense. Knowing whether you do ETL or ELT is the first decision, but what determines whether your pipeline is maintainable or a slow-motion disaster is how you structure what’s inside.

This article doesn’t rehash the ETL vs ELT comparison. It assumes you already have data arriving at a destination and need to decide what to store, where to transform, and what to expose. The three classic layers---raw, staging, and marts---are the answer that works best in practice. Not because they’re the only option, but because they solve real problems that always show up: data gets corrupted, transformations nobody understands, and metrics that don’t add up because someone accidentally modified the raw data.


Why you need layers and not one giant table

The natural impulse when you start working with data is to create a single table where you dump everything. A CSV that you read, clean, and load. A script that connects to an API, transforms the JSON, and inserts it directly into the final table. It works on day one. By the third month, you have a problem.

The main reason for separating into layers is not aesthetic or academic. It’s that without separation, any error in the transformation destroys the original data and you can’t recover it.

I’ve seen pipelines where someone decided to “clean” the data directly in the only table that existed. When we discovered that the cleaning had a bug that was deleting valid records, there was no way to reconstruct them. The raw data no longer existed. That doesn’t happen if you respect the layers.

Layers solve three concrete problems:

  1. Immutability of the raw data. You can always go back to the source.
  2. Separation of concerns. Each layer has a clear purpose.
  3. Debuggability. If a metric doesn’t add up, you can trace which layer broke it.

The raw layer: the data exactly as it arrives

The raw layer is the warehouse for what you receive. No filters, no transformations, no opinions. If an API returns a JSON with empty fields, duplicates, and timestamps in a weird format, that is exactly what you store.

What to store in raw

  • The complete data exactly as it arrives from the source.
  • An ingestion timestamp (_ingested_at) to know when it was received.
  • Optionally, source metadata: source name, schema version, pipeline execution ID.

What NOT to do in raw

  • Don’t clean nulls.
  • Don’t change data types.
  • Don’t deduplicate.
  • Don’t filter records that “look like garbage.”

The temptation to “fix things up a little” in raw is the most common mistake I’ve seen. A colleague once told me: “I just removed the duplicates before storing them, to save space.” The problem came when it turned out those “duplicates” were actually updates of the same record with different timestamps, and by deduplicating we lost the change history.

Example: raw ingestion in SQL

CREATE TABLE raw.orders (
    _ingested_at    TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    _source         VARCHAR(50) DEFAULT 'api_ecommerce_v2',
    _batch_id       VARCHAR(36),
    payload         JSONB  -- complete raw data, untouched
);

-- Insertion: we dump the JSON as-is
INSERT INTO raw.orders (_batch_id, payload)
VALUES (
    'batch-2026-05-18-001',
    '{"order_id": "A1234", "customer": "jdoe", "total": "89.99", "status": "pending", "created_at": "2026-05-17T14:32:00"}'
);

Example: raw ingestion in Python

import json
import psycopg2
from datetime import datetime
from uuid import uuid4

def ingest_raw_orders(orders: list[dict], conn):
    """Stores orders exactly as they arrive, without transforming anything."""
    batch_id = str(uuid4())
    cursor = conn.cursor()

    for order in orders:
        cursor.execute(
            """
            INSERT INTO raw.orders (_batch_id, payload)
            VALUES (%s, %s)
            """,
            (batch_id, json.dumps(order))
        )

    conn.commit()
    print(f"[RAW] Ingested {len(orders)} records | batch: {batch_id}")

The raw data is your life insurance. If tomorrow you change the transformation logic, if you discover a bug in staging, if the business redefines a metric, you can always go back to raw and reprocess.


The staging layer: clean, type, deduplicate

Staging is where the data starts taking shape. Here you apply technical transformations that don’t depend on business rules: correct typing, duplicate removal, null handling, format normalization.

What happens in staging

OperationExample
TypingConvert "89.99" (string) to 89.99 (decimal)
DeduplicationKeep the most recent record per order_id
NullsDecide whether a null field is discarded, flagged, or filled
NormalizationUnify date formats, country codes, statuses
Basic validationReject records that don’t meet a minimum schema

What does NOT happen in staging

  • You don’t calculate business metrics.
  • You don’t aggregate data (no SUM, AVG, COUNT for reports).
  • You don’t apply rules that only the business can define.

The boundary between staging and marts is sometimes blurry. My rule is: if any data engineer could do the transformation without knowing the business domain, it goes in staging. If you need to talk to product or finance to know how to calculate it, it goes in marts.

Example: staging transformation in SQL

CREATE TABLE staging.orders AS
SELECT
    payload->>'order_id'                          AS order_id,
    payload->>'customer'                          AS customer_id,
    CAST(payload->>'total' AS DECIMAL(10,2))      AS total_amount,
    LOWER(payload->>'status')                     AS status,
    CAST(payload->>'created_at' AS TIMESTAMP)     AS order_created_at,
    _ingested_at,
    _batch_id
FROM raw.orders;

-- Deduplication: keep the most recent version of each order
DELETE FROM staging.orders
WHERE ctid NOT IN (
    SELECT DISTINCT ON (order_id) ctid
    FROM staging.orders
    ORDER BY order_id, _ingested_at DESC
);

Example: staging transformation in Python

import pandas as pd

def transform_staging_orders(raw_df: pd.DataFrame) -> pd.DataFrame:
    """Cleans and types the raw orders."""
    df = raw_df.copy()

    # Extract fields from JSON payload
    df['order_id'] = df['payload'].apply(lambda x: x.get('order_id'))
    df['customer_id'] = df['payload'].apply(lambda x: x.get('customer'))
    df['total_amount'] = pd.to_numeric(
        df['payload'].apply(lambda x: x.get('total')),
        errors='coerce'
    )
    df['status'] = df['payload'].apply(
        lambda x: x.get('status', '').lower()
    )
    df['order_created_at'] = pd.to_datetime(
        df['payload'].apply(lambda x: x.get('created_at')),
        errors='coerce'
    )

    # Remove records without order_id (corrupt data)
    df = df.dropna(subset=['order_id'])

    # Deduplicate: keep the most recent record per order_id
    df = df.sort_values('_ingested_at', ascending=False)
    df = df.drop_duplicates(subset=['order_id'], keep='first')

    # Select final columns
    staging_cols = [
        'order_id', 'customer_id', 'total_amount',
        'status', 'order_created_at', '_ingested_at', '_batch_id'
    ]
    return df[staging_cols].reset_index(drop=True)

In staging the goal is to have clean, typed, deduplicated data. Nothing more and nothing less. If you arrive here with reliable data, marts becomes a much simpler problem.


The marts layer: business models and metrics

Marts is where data becomes answers. Here you create the tables, views, and models the business consumes: dashboards, reports, KPIs, predictive models. The transformations in marts are domain-level, not technical.

What happens in marts

  • Business calculations: revenue, churn, LTV, conversion rate.
  • Aggregations: sales by month, orders by customer, tickets by category.
  • Joins between staging entities to create business views.
  • Dimensional models (fact tables, dimension tables) if your warehouse requires it.

Example: daily sales mart

CREATE TABLE marts.daily_sales AS
SELECT
    DATE(order_created_at)              AS sale_date,
    COUNT(*)                            AS total_orders,
    COUNT(DISTINCT customer_id)         AS unique_customers,
    SUM(total_amount)                   AS total_revenue,
    AVG(total_amount)                   AS avg_order_value,
    SUM(CASE WHEN status = 'completed'
        THEN total_amount ELSE 0 END)  AS confirmed_revenue
FROM staging.orders
WHERE status IN ('completed', 'pending', 'shipped')
GROUP BY DATE(order_created_at)
ORDER BY sale_date DESC;

Example: customer mart for the product team

CREATE VIEW marts.customer_summary AS
SELECT
    customer_id,
    COUNT(*)                            AS total_orders,
    SUM(total_amount)                   AS lifetime_value,
    MIN(order_created_at)               AS first_order_at,
    MAX(order_created_at)               AS last_order_at,
    AVG(total_amount)                   AS avg_order_value,
    CURRENT_DATE - MAX(order_created_at)::date AS days_since_last_order
FROM staging.orders
WHERE status = 'completed'
GROUP BY customer_id;

The key difference from staging is that marts has business opinion. “Confirmed revenue” only counts orders with status completed. That rule isn’t technical; it’s a finance decision. If tomorrow they decide that shipped also counts as confirmed revenue, you change the mart, not staging or raw.


The full flow: from source to dashboard

To see it globally, this is the pipeline with the three layers:

Source (API, DB, CSV)


   ┌────────┐
   │  RAW   │  Raw data + ingestion timestamp
   │        │  Immutable, untransformed
   └───┬────┘


  ┌──────────┐
  │ STAGING  │  Typed, deduplicated, nulls handled
  │          │  Technical transformations
  └────┬─────┘


   ┌────────┐
   │ MARTS  │  Metrics, KPIs, business views
   │        │  Domain transformations
   └───┬────┘


  Dashboard / API / ML Model

Orchestration with a simple Python script

def run_pipeline():
    """Full pipeline: raw -> staging -> marts."""
    conn = get_connection()

    # 1. Raw: ingest raw data
    raw_orders = fetch_orders_from_api()
    ingest_raw_orders(raw_orders, conn)
    print("[PIPELINE] Raw completed")

    # 2. Staging: clean and type
    raw_df = pd.read_sql("SELECT * FROM raw.orders", conn)
    staging_df = transform_staging_orders(raw_df)
    staging_df.to_sql('orders', conn, schema='staging', if_exists='replace', index=False)
    print("[PIPELINE] Staging completed")

    # 3. Marts: create business views
    with conn.cursor() as cur:
        cur.execute(open('sql/marts/daily_sales.sql').read())
        cur.execute(open('sql/marts/customer_summary.sql').read())
    conn.commit()
    print("[PIPELINE] Marts completed")

if __name__ == '__main__':
    run_pipeline()

Common mistakes that break pipelines

After building and maintaining several pipelines with this structure, there are error patterns that keep repeating. The most dangerous ones aren’t technical bugs---they’re design decisions that seem harmless at first.

1. Transforming in raw

The classic. “I just normalize the dates before storing them.” Seems reasonable until you discover that the source changed the date format without notice and your normalization has been discarding valid records for weeks. In raw you don’t touch anything. Period.

2. Mixing staging and marts

Putting business calculations in staging is tempting when you’re in a rush. “Since I’m already cleaning, I’ll also calculate the revenue.” The problem comes when the definition of revenue changes and you have to touch staging, which invalidates every table that depends on it.

3. Not versioning schemas

If your staging table has 12 columns today and tomorrow the source adds 3 new fields, your pipeline breaks. Or worse: it keeps working but silently ignores the new fields. Versioning schemas---even with a document or a YAML file---alerts you when something changes.

# schema_orders_v2.yaml
version: 2
source: api_ecommerce
fields:
  - name: order_id
    type: string
    nullable: false
  - name: customer
    type: string
    nullable: false
  - name: total
    type: string  # arrives as string, typed in staging
    nullable: false
  - name: status
    type: string
    nullable: false
    allowed_values: [pending, completed, shipped, cancelled]
  - name: created_at
    type: string  # ISO 8601
    nullable: false

4. Not having _ingested_at in raw

Without an ingestion timestamp you can’t know when a piece of data arrived. If you need to reprocess “the data that arrived yesterday,” without that field you’re lost. It’s a field that costs nothing to add and saves you constantly.

5. Marts tables without business rule documentation

If someone looks at confirmed_revenue in your mart and there’s nowhere that explains it only includes orders with status completed, they’ll assume whatever suits them. Document the business rules next to the SQL or in the schema itself.


When this structure is overkill

Not everything needs three formal layers. If your pipeline is a script that reads a 200-row CSV and loads it into a Google Sheet for a monthly report, setting up raw/staging/marts would be over-engineering.

My criteria for deciding when it’s worth it:

SituationRaw/Staging/MartsA direct script
More than one data sourceYesProbably not
Data is reprocessed periodicallyYesNo
Multiple people consume the metricsYesDepends
Raw data can change formatYesHigh risk
One-time pipeline, ad-hoc analysisNoYes
Production dashboard with SLAYesNever

If your pipeline has more than one source, runs on a recurring basis, and more than one person consumes the results, the three layers will save you trouble. It’s a small investment that scales well.


Relationship with ETL and ELT

If you’re coming from the previous article, here’s how it fits together:

  • In a classic ETL flow, staging transformations happen before loading into the warehouse. Raw can be a temporary directory of files or a staging table.
  • In a modern ELT flow, everything arrives first at the warehouse (raw), and the staging and marts transformations are done with SQL inside the warehouse itself. Tools like dbt fit perfectly into this model.

The raw/staging/marts structure doesn’t compete with ETL or ELT. It’s complementary. It defines the internal organization of the destination, regardless of how the data arrives.


What I wish I had known earlier

When I set up my first pipeline with these layers, I made almost every mistake I’ve described. What changed my perspective was to stop thinking of the layers as technical bureaucracy and start seeing them as a contract: raw promises immutability, staging promises cleanliness, marts promises business relevance.

Each layer has a single responsibility. And just like in code, when a piece has only one reason to change, the entire system is easier to maintain, debug, and evolve.

If you already have a working pipeline without layers, you don’t need to rewrite it all at once. Start by separating raw. Just that---storing the raw data before touching it---already changes the game.

OshyTech

Backend and data engineering focused on scalable systems, automation, and AI.

Navigation

Copyright 2026 OshyTech. All Rights Reserved