What are Slowly Changing Dimensions?
Slowly Changing Dimensions (SCD) is a methodology for tracking changes in dimension attributes over time in a data warehouse. When a customer changes their address or a product changes its category, how do you handle it?
SCDs provide different strategies (Types 0-6) for preserving or updating historical data based on business requirements.
SCD Types Overview
┌─────────────────────────────────────────────────────────────────┐
│ SCD Types Summary │
├──────┬──────────────────────────────────────────────────────────┤
│ Type │ Description │
├──────┼──────────────────────────────────────────────────────────┤
│ 0 │ Retain Original - Never update, keep first value │
│ 1 │ Overwrite - Update in place, no history │
│ 2 │ Add Row - Insert new row for each change (most common) │
│ 3 │ Add Column - Track current and previous value │
│ 4 │ Mini-Dimension - Separate table for changing attributes │
│ 6 │ Hybrid (1+2+3) - Combines multiple approaches │
└──────┴──────────────────────────────────────────────────────────┘
Type 1: Overwrite
Simply update the existing record. No history is preserved.
-- Type 1: Overwrite (No History)
-- Use when: History doesn't matter, corrections, typos
-- Before: Customer moved from NYC to LA
-- | customer_key | customer_id | name | city |
-- | 1 | C001 | Alice | NYC |
-- After update:
-- | customer_key | customer_id | name | city |
-- | 1 | C001 | Alice | LA | ← City overwritten
-- SQL Implementation
UPDATE dim_customer
SET city = 'LA',
updated_at = CURRENT_TIMESTAMP
WHERE customer_id = 'C001';
-- Delta Lake / Spark Implementation
from delta.tables import DeltaTable
dim_customer = DeltaTable.forPath(spark, "/delta/dim_customer")
dim_customer.alias("target").merge(
source_df.alias("source"),
"target.customer_id = source.customer_id"
).whenMatchedUpdate(set={
"city": "source.city",
"updated_at": "current_timestamp()"
}).whenNotMatchedInsertAll().execute()
Type 2: Add New Row
Insert a new row for each change, preserving full history. Most commonly used.
-- Type 2: Add New Row (Full History)
-- Use when: Full history required for analytics
-- Table structure with SCD Type 2 columns
CREATE TABLE dim_customer (
customer_key BIGINT PRIMARY KEY, -- Surrogate key
customer_id VARCHAR(20), -- Natural/business key
name VARCHAR(100),
city VARCHAR(50),
effective_date DATE, -- When this version started
end_date DATE, -- When this version ended (NULL = current)
is_current BOOLEAN -- Flag for current record
);
-- Before: Customer in NYC
-- | customer_key | customer_id | name | city | effective_date | end_date | is_current |
-- | 1 | C001 | Alice | NYC | 2023-01-01 | NULL | true |
-- After customer moves to LA on 2024-06-15:
-- | customer_key | customer_id | name | city | effective_date | end_date | is_current |
-- | 1 | C001 | Alice | NYC | 2023-01-01 | 2024-06-14 | false | ← Closed
-- | 2 | C001 | Alice | LA | 2024-06-15 | NULL | true | ← New
-- Implementation Steps:
-- 1. Expire the current record
UPDATE dim_customer
SET end_date = DATE('2024-06-14'),
is_current = false
WHERE customer_id = 'C001'
AND is_current = true;
-- 2. Insert new record
INSERT INTO dim_customer
(customer_key, customer_id, name, city, effective_date, end_date, is_current)
VALUES
(2, 'C001', 'Alice', 'LA', '2024-06-15', NULL, true);
Type 2: Delta Lake Implementation
from pyspark.sql.functions import *
from delta.tables import DeltaTable
def apply_scd_type2(spark, target_path, source_df, key_column, tracked_columns):
"""
Apply SCD Type 2 logic using Delta Lake MERGE
"""
target = DeltaTable.forPath(spark, target_path)
# Prepare source with change detection
source_with_hash = source_df.withColumn(
"source_hash",
md5(concat_ws("||", *[col(c) for c in tracked_columns]))
)
# Get current records from target
current_target = target.toDF().filter(col("is_current") == True)
current_target = current_target.withColumn(
"target_hash",
md5(concat_ws("||", *[col(c) for c in tracked_columns]))
)
# Find changed records
changes = source_with_hash.alias("s").join(
current_target.alias("t"),
col(f"s.{key_column}") == col(f"t.{key_column}")
).filter(col("source_hash") != col("target_hash")).select("s.*")
# Find new records
new_records = source_with_hash.alias("s").join(
current_target.alias("t"),
col(f"s.{key_column}") == col(f"t.{key_column}"),
"left_anti"
)
# Expire changed records
if changes.count() > 0:
target.alias("t").merge(
changes.alias("s"),
f"t.{key_column} = s.{key_column} AND t.is_current = true"
).whenMatchedUpdate(set={
"end_date": "current_date() - 1",
"is_current": "false"
}).execute()
# Insert new versions and new records
inserts = changes.union(new_records).select(
*[col(c) for c in source_df.columns],
current_date().alias("effective_date"),
lit(None).cast("date").alias("end_date"),
lit(True).alias("is_current")
)
inserts.write.format("delta").mode("append").save(target_path)
# Usage
apply_scd_type2(
spark,
"/delta/dim_customer",
new_customers_df,
key_column="customer_id",
tracked_columns=["name", "city", "email"]
)
Type 3: Add Column
Track current and previous values in separate columns. Limited history.
-- Type 3: Add Column (Limited History)
-- Use when: Only need previous value, not full history
CREATE TABLE dim_customer (
customer_key BIGINT PRIMARY KEY,
customer_id VARCHAR(20),
name VARCHAR(100),
current_city VARCHAR(50),
previous_city VARCHAR(50), -- Stores one previous value
city_change_date DATE
);
-- Before:
-- | customer_key | customer_id | name | current_city | previous_city | city_change_date |
-- | 1 | C001 | Alice | NYC | NULL | NULL |
-- After moving to LA:
-- | customer_key | customer_id | name | current_city | previous_city | city_change_date |
-- | 1 | C001 | Alice | LA | NYC | 2024-06-15 |
-- SQL Implementation
UPDATE dim_customer
SET previous_city = current_city,
current_city = 'LA',
city_change_date = '2024-06-15'
WHERE customer_id = 'C001';
-- Limitation: If Alice moves again (LA → Chicago), NYC is lost!
Type 6: Hybrid
Combines Types 1, 2, and 3 for maximum flexibility.
-- Type 6: Hybrid (1 + 2 + 3)
-- Best of all worlds: Full history + current value everywhere
CREATE TABLE dim_customer (
customer_key BIGINT PRIMARY KEY, -- Surrogate key
customer_id VARCHAR(20), -- Business key
name VARCHAR(100),
city VARCHAR(50), -- Historical value (Type 2)
current_city VARCHAR(50), -- Current value (Type 1)
effective_date DATE,
end_date DATE,
is_current BOOLEAN
);
-- Example: Customer moved NYC → LA → Chicago
-- | customer_key | customer_id | name | city | current_city | effective | end_date | is_current |
-- | 1 | C001 | Alice | NYC | Chicago | 2023-01-01 | 2024-03-14 | false |
-- | 2 | C001 | Alice | LA | Chicago | 2024-03-15 | 2024-09-14 | false |
-- | 3 | C001 | Alice | Chicago | Chicago | 2024-09-15 | NULL | true |
-- Benefits:
-- • city column: Historical value at that point in time (for point-in-time analysis)
-- • current_city: Always shows current value (easy current-state queries)
-- • All history preserved
-- Query: Sales by customer's city AT TIME OF SALE
SELECT
f.sale_date,
d.city AS city_at_sale, -- Historical (Type 2)
f.amount
FROM fact_sales f
JOIN dim_customer d ON f.customer_key = d.customer_key;
-- Query: Current customer locations (simple)
SELECT customer_id, current_city
FROM dim_customer
WHERE is_current = true;
Choosing the Right Type
┌─────────────────────────────────────────────────────────────────┐
│ When to Use Each SCD Type │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Type 1 (Overwrite) │
│ ✓ Corrections/typos │
│ ✓ Attributes that don't need history │
│ ✓ Data that is always "current" (e.g., calculated fields) │
│ │
│ Type 2 (Add Row) │
│ ✓ Full audit trail required │
│ ✓ Point-in-time analysis needed │
│ ✓ Regulatory/compliance requirements │
│ ✓ Most dimension attributes │
│ │
│ Type 3 (Add Column) │
│ ✓ Only need one previous value │
│ ✓ Simple comparison current vs previous │
│ ✓ Changes are rare │
│ │
│ Type 6 (Hybrid) │
│ ✓ Need both history AND easy current value access │
│ ✓ Complex reporting requirements │
│ ✓ Performance-critical current-state queries │
│ │
└─────────────────────────────────────────────────────────────────┘
Best Practices
- Use surrogate keys: Always use surrogate keys for Type 2 dimensions
- Standard columns: Use consistent naming (effective_date, end_date, is_current)
- Index properly: Index business key + is_current for fast lookups
- Handle deletes: Consider "deleted" flag instead of physical deletion
- Combine types: Different attributes can use different types within one table
- Document decisions: Record why each attribute uses a specific SCD type
Master Data Modeling
Our Data Engineering program covers SCDs, dimensional modeling, and data warehouse design.
Explore Data Engineering Program