Slowly Changing Dimensions: SCD Types 1, 2, 3

What are Slowly Changing Dimensions?

Slowly Changing Dimensions (SCD) is a methodology for tracking changes in dimension attributes over time in a data warehouse. When a customer changes their address or a product changes its category, how do you handle it?

SCDs provide different strategies (Types 0-6) for preserving or updating historical data based on business requirements.

SCD Types Overview

┌─────────────────────────────────────────────────────────────────┐
│                       SCD Types Summary                          │
├──────┬──────────────────────────────────────────────────────────┤
│ Type │ Description                                               │
├──────┼──────────────────────────────────────────────────────────┤
│  0   │ Retain Original - Never update, keep first value         │
│  1   │ Overwrite - Update in place, no history                  │
│  2   │ Add Row - Insert new row for each change (most common)   │
│  3   │ Add Column - Track current and previous value            │
│  4   │ Mini-Dimension - Separate table for changing attributes  │
│  6   │ Hybrid (1+2+3) - Combines multiple approaches            │
└──────┴──────────────────────────────────────────────────────────┘

Type 1: Overwrite

Simply update the existing record. No history is preserved.

-- Type 1: Overwrite (No History)
-- Use when: History doesn't matter, corrections, typos

-- Before: Customer moved from NYC to LA
-- | customer_key | customer_id | name  | city |
-- |      1       |   C001      | Alice | NYC  |

-- After update:
-- | customer_key | customer_id | name  | city |
-- |      1       |   C001      | Alice | LA   |  ← City overwritten

-- SQL Implementation
UPDATE dim_customer
SET city = 'LA',
    updated_at = CURRENT_TIMESTAMP
WHERE customer_id = 'C001';

-- Delta Lake / Spark Implementation
from delta.tables import DeltaTable

dim_customer = DeltaTable.forPath(spark, "/delta/dim_customer")

dim_customer.alias("target").merge(
    source_df.alias("source"),
    "target.customer_id = source.customer_id"
).whenMatchedUpdate(set={
    "city": "source.city",
    "updated_at": "current_timestamp()"
}).whenNotMatchedInsertAll().execute()

Type 2: Add New Row

Insert a new row for each change, preserving full history. Most commonly used.

-- Type 2: Add New Row (Full History)
-- Use when: Full history required for analytics

-- Table structure with SCD Type 2 columns
CREATE TABLE dim_customer (
    customer_key    BIGINT PRIMARY KEY,    -- Surrogate key
    customer_id     VARCHAR(20),           -- Natural/business key
    name            VARCHAR(100),
    city            VARCHAR(50),
    effective_date  DATE,                  -- When this version started
    end_date        DATE,                  -- When this version ended (NULL = current)
    is_current      BOOLEAN                -- Flag for current record
);

-- Before: Customer in NYC
-- | customer_key | customer_id | name  | city | effective_date | end_date   | is_current |
-- |      1       |   C001      | Alice | NYC  | 2023-01-01     | NULL       | true       |

-- After customer moves to LA on 2024-06-15:
-- | customer_key | customer_id | name  | city | effective_date | end_date   | is_current |
-- |      1       |   C001      | Alice | NYC  | 2023-01-01     | 2024-06-14 | false      | ← Closed
-- |      2       |   C001      | Alice | LA   | 2024-06-15     | NULL       | true       | ← New

-- Implementation Steps:
-- 1. Expire the current record
UPDATE dim_customer
SET end_date = DATE('2024-06-14'),
    is_current = false
WHERE customer_id = 'C001'
  AND is_current = true;

-- 2. Insert new record
INSERT INTO dim_customer
(customer_key, customer_id, name, city, effective_date, end_date, is_current)
VALUES
(2, 'C001', 'Alice', 'LA', '2024-06-15', NULL, true);

Type 2: Delta Lake Implementation

from pyspark.sql.functions import *
from delta.tables import DeltaTable

def apply_scd_type2(spark, target_path, source_df, key_column, tracked_columns):
    """
    Apply SCD Type 2 logic using Delta Lake MERGE
    """
    target = DeltaTable.forPath(spark, target_path)

    # Prepare source with change detection
    source_with_hash = source_df.withColumn(
        "source_hash",
        md5(concat_ws("||", *[col(c) for c in tracked_columns]))
    )

    # Get current records from target
    current_target = target.toDF().filter(col("is_current") == True)
    current_target = current_target.withColumn(
        "target_hash",
        md5(concat_ws("||", *[col(c) for c in tracked_columns]))
    )

    # Find changed records
    changes = source_with_hash.alias("s").join(
        current_target.alias("t"),
        col(f"s.{key_column}") == col(f"t.{key_column}")
    ).filter(col("source_hash") != col("target_hash")).select("s.*")

    # Find new records
    new_records = source_with_hash.alias("s").join(
        current_target.alias("t"),
        col(f"s.{key_column}") == col(f"t.{key_column}"),
        "left_anti"
    )

    # Expire changed records
    if changes.count() > 0:
        target.alias("t").merge(
            changes.alias("s"),
            f"t.{key_column} = s.{key_column} AND t.is_current = true"
        ).whenMatchedUpdate(set={
            "end_date": "current_date() - 1",
            "is_current": "false"
        }).execute()

    # Insert new versions and new records
    inserts = changes.union(new_records).select(
        *[col(c) for c in source_df.columns],
        current_date().alias("effective_date"),
        lit(None).cast("date").alias("end_date"),
        lit(True).alias("is_current")
    )

    inserts.write.format("delta").mode("append").save(target_path)

# Usage
apply_scd_type2(
    spark,
    "/delta/dim_customer",
    new_customers_df,
    key_column="customer_id",
    tracked_columns=["name", "city", "email"]
)

Type 3: Add Column

Track current and previous values in separate columns. Limited history.

-- Type 3: Add Column (Limited History)
-- Use when: Only need previous value, not full history

CREATE TABLE dim_customer (
    customer_key    BIGINT PRIMARY KEY,
    customer_id     VARCHAR(20),
    name            VARCHAR(100),
    current_city    VARCHAR(50),
    previous_city   VARCHAR(50),    -- Stores one previous value
    city_change_date DATE
);

-- Before:
-- | customer_key | customer_id | name  | current_city | previous_city | city_change_date |
-- |      1       |   C001      | Alice | NYC          | NULL          | NULL             |

-- After moving to LA:
-- | customer_key | customer_id | name  | current_city | previous_city | city_change_date |
-- |      1       |   C001      | Alice | LA           | NYC           | 2024-06-15       |

-- SQL Implementation
UPDATE dim_customer
SET previous_city = current_city,
    current_city = 'LA',
    city_change_date = '2024-06-15'
WHERE customer_id = 'C001';

-- Limitation: If Alice moves again (LA → Chicago), NYC is lost!

Type 6: Hybrid

Combines Types 1, 2, and 3 for maximum flexibility.

-- Type 6: Hybrid (1 + 2 + 3)
-- Best of all worlds: Full history + current value everywhere

CREATE TABLE dim_customer (
    customer_key    BIGINT PRIMARY KEY,    -- Surrogate key
    customer_id     VARCHAR(20),           -- Business key
    name            VARCHAR(100),
    city            VARCHAR(50),           -- Historical value (Type 2)
    current_city    VARCHAR(50),           -- Current value (Type 1)
    effective_date  DATE,
    end_date        DATE,
    is_current      BOOLEAN
);

-- Example: Customer moved NYC → LA → Chicago

-- | customer_key | customer_id | name  | city    | current_city | effective  | end_date   | is_current |
-- |      1       |   C001      | Alice | NYC     | Chicago      | 2023-01-01 | 2024-03-14 | false      |
-- |      2       |   C001      | Alice | LA      | Chicago      | 2024-03-15 | 2024-09-14 | false      |
-- |      3       |   C001      | Alice | Chicago | Chicago      | 2024-09-15 | NULL       | true       |

-- Benefits:
-- • city column: Historical value at that point in time (for point-in-time analysis)
-- • current_city: Always shows current value (easy current-state queries)
-- • All history preserved

-- Query: Sales by customer's city AT TIME OF SALE
SELECT
    f.sale_date,
    d.city AS city_at_sale,        -- Historical (Type 2)
    f.amount
FROM fact_sales f
JOIN dim_customer d ON f.customer_key = d.customer_key;

-- Query: Current customer locations (simple)
SELECT customer_id, current_city
FROM dim_customer
WHERE is_current = true;

Choosing the Right Type

┌─────────────────────────────────────────────────────────────────┐
│              When to Use Each SCD Type                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Type 1 (Overwrite)                                              │
│  ✓ Corrections/typos                                             │
│  ✓ Attributes that don't need history                            │
│  ✓ Data that is always "current" (e.g., calculated fields)       │
│                                                                  │
│  Type 2 (Add Row)                                                │
│  ✓ Full audit trail required                                     │
│  ✓ Point-in-time analysis needed                                 │
│  ✓ Regulatory/compliance requirements                            │
│  ✓ Most dimension attributes                                     │
│                                                                  │
│  Type 3 (Add Column)                                             │
│  ✓ Only need one previous value                                  │
│  ✓ Simple comparison current vs previous                         │
│  ✓ Changes are rare                                              │
│                                                                  │
│  Type 6 (Hybrid)                                                 │
│  ✓ Need both history AND easy current value access               │
│  ✓ Complex reporting requirements                                │
│  ✓ Performance-critical current-state queries                    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Best Practices

Use surrogate keys: Always use surrogate keys for Type 2 dimensions
Standard columns: Use consistent naming (effective_date, end_date, is_current)
Index properly: Index business key + is_current for fast lookups
Handle deletes: Consider "deleted" flag instead of physical deletion
Combine types: Different attributes can use different types within one table
Document decisions: Record why each attribute uses a specific SCD type

Master Data Modeling

Our Data Engineering program covers SCDs, dimensional modeling, and data warehouse design.

Explore Data Engineering Program

Slowly Changing Dimensions