Data Modeling: Star Schema, Snowflake & Data Vault

What is Data Modeling?

Data modeling is the process of creating a visual representation of data structures and their relationships. In data engineering, it focuses on organizing data for efficient storage, retrieval, and analysis in data warehouses and analytical systems.

Good data models make queries faster, data easier to understand, and analytics more reliable.

OLTP vs OLAP

Understanding the difference is crucial for choosing the right model:

┌─────────────────────────────────────────────────────────────┐
│                    OLTP vs OLAP                             │
├─────────────────────────────────────────────────────────────┤
│  OLTP (Transactional)        │  OLAP (Analytical)          │
├──────────────────────────────┼─────────────────────────────┤
│  • Many small transactions   │  • Complex queries          │
│  • Insert, Update, Delete    │  • Read-heavy workloads     │
│  • Normalized (3NF)          │  • Denormalized             │
│  • Current data              │  • Historical data          │
│  • Application databases     │  • Data warehouses          │
│  • MySQL, PostgreSQL         │  • Redshift, Snowflake      │
└──────────────────────────────┴─────────────────────────────┘

Dimensional Modeling Concepts

Dimensional modeling organizes data into facts and dimensions:

Facts: Measurable events (sales, clicks, transactions) with numeric values
Dimensions: Context for facts (who, what, when, where, why)
Measures: Numeric values in facts (amount, quantity, duration)
Attributes: Descriptive fields in dimensions (name, category, region)

-- Example: Sales Fact Table
CREATE TABLE fact_sales (
    sale_id         BIGINT PRIMARY KEY,
    date_key        INT REFERENCES dim_date(date_key),
    product_key     INT REFERENCES dim_product(product_key),
    customer_key    INT REFERENCES dim_customer(customer_key),
    store_key       INT REFERENCES dim_store(store_key),

    -- Measures (numeric, aggregatable)
    quantity        INT,
    unit_price      DECIMAL(10,2),
    total_amount    DECIMAL(12,2),
    discount_amount DECIMAL(10,2)
);

-- Example: Product Dimension Table
CREATE TABLE dim_product (
    product_key     INT PRIMARY KEY,
    product_id      VARCHAR(20),      -- Natural key
    product_name    VARCHAR(100),
    category        VARCHAR(50),
    subcategory     VARCHAR(50),
    brand           VARCHAR(50),
    unit_cost       DECIMAL(10,2)
);

Star Schema

The most common dimensional model - simple and fast for queries:

                    ┌─────────────┐
                    │  dim_date   │
                    └──────┬──────┘
                           │
┌─────────────┐    ┌───────┴───────┐    ┌─────────────┐
│ dim_product │────│  fact_sales   │────│ dim_customer│
└─────────────┘    └───────┬───────┘    └─────────────┘
                           │
                    ┌──────┴──────┐
                    │  dim_store  │
                    └─────────────┘

Characteristics:
• Fact table in center, dimensions around it
• Dimensions are denormalized (flat)
• Simple joins (one level)
• Best query performance
• Some data redundancy

-- Star Schema Query Example
SELECT
    d.year,
    d.month_name,
    p.category,
    c.customer_segment,
    SUM(f.total_amount) as total_sales,
    COUNT(DISTINCT f.sale_id) as num_transactions
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_customer c ON f.customer_key = c.customer_key
WHERE d.year = 2024
GROUP BY d.year, d.month_name, p.category, c.customer_segment
ORDER BY total_sales DESC;

Snowflake Schema

Normalized version of star schema - dimensions are split into sub-dimensions:

         ┌────────────┐
         │dim_category│
         └─────┬──────┘
               │
         ┌─────┴──────┐         ┌─────────────┐
         │dim_product │─────────│ fact_sales  │
         └────────────┘         └──────┬──────┘
                                       │
                    ┌──────────────────┼──────────────────┐
                    │                  │                  │
              ┌─────┴─────┐     ┌──────┴──────┐    ┌──────┴──────┐
              │ dim_date  │     │dim_customer │    │  dim_store  │
              └───────────┘     └──────┬──────┘    └──────┬──────┘
                                       │                  │
                                ┌──────┴──────┐    ┌──────┴──────┐
                                │ dim_segment │    │  dim_region │
                                └─────────────┘    └─────────────┘

Characteristics:
• Dimensions normalized into sub-tables
• Less data redundancy
• More complex joins
• Slower query performance
• Better for dimension maintenance

Data Vault

A modeling approach designed for agility and auditability:

Data Vault Components:
├── Hubs      → Business keys (unique identifiers)
├── Links     → Relationships between hubs
└── Satellites → Descriptive attributes with history

┌─────────────┐         ┌─────────────┐         ┌─────────────┐
│  hub_customer│◄───────│link_cust_ord│───────►│  hub_order   │
└──────┬──────┘         └─────────────┘         └──────┬──────┘
       │                                                │
┌──────┴──────┐                                 ┌──────┴──────┐
│sat_customer │                                 │  sat_order  │
│  _details   │                                 │  _details   │
└─────────────┘                                 └─────────────┘

-- Hub: Business Keys
CREATE TABLE hub_customer (
    hub_customer_key    BIGINT PRIMARY KEY,
    customer_id         VARCHAR(50) NOT NULL,  -- Business key
    load_date           TIMESTAMP,
    record_source       VARCHAR(100)
);

-- Satellite: Descriptive Attributes (with history)
CREATE TABLE sat_customer_details (
    hub_customer_key    BIGINT REFERENCES hub_customer,
    load_date           TIMESTAMP,
    load_end_date       TIMESTAMP,
    customer_name       VARCHAR(100),
    email               VARCHAR(100),
    phone               VARCHAR(20),
    address             VARCHAR(200),
    record_source       VARCHAR(100),
    PRIMARY KEY (hub_customer_key, load_date)
);

-- Link: Relationships
CREATE TABLE link_customer_order (
    link_cust_ord_key   BIGINT PRIMARY KEY,
    hub_customer_key    BIGINT REFERENCES hub_customer,
    hub_order_key       BIGINT REFERENCES hub_order,
    load_date           TIMESTAMP,
    record_source       VARCHAR(100)
);

Choosing the Right Model

┌─────────────────────────────────────────────────────────────────┐
│                  When to Use Each Model                         │
├─────────────────────────────────────────────────────────────────┤
│  Star Schema                                                    │
│  ✓ Simple reporting and BI tools                                │
│  ✓ Known, stable requirements                                   │
│  ✓ Query performance is priority                                │
│  ✓ Self-service analytics                                       │
├─────────────────────────────────────────────────────────────────┤
│  Snowflake Schema                                               │
│  ✓ Dimension tables are very large                              │
│  ✓ Storage is a concern                                         │
│  ✓ Dimension data changes frequently                            │
│  ✓ Strict normalization required                                │
├─────────────────────────────────────────────────────────────────┤
│  Data Vault                                                     │
│  ✓ Multiple source systems                                      │
│  ✓ Requirements change frequently                               │
│  ✓ Full audit trail needed                                      │
│  ✓ Agile/iterative development                                  │
│  ✓ Enterprise data warehouse                                    │
└─────────────────────────────────────────────────────────────────┘

Date Dimension Best Practices

-- Comprehensive Date Dimension
CREATE TABLE dim_date (
    date_key            INT PRIMARY KEY,        -- YYYYMMDD format
    full_date           DATE NOT NULL,

    -- Day attributes
    day_of_week         INT,                    -- 1-7
    day_name            VARCHAR(10),            -- Monday, Tuesday...
    day_of_month        INT,                    -- 1-31
    day_of_year         INT,                    -- 1-366
    is_weekend          BOOLEAN,
    is_holiday          BOOLEAN,
    holiday_name        VARCHAR(50),

    -- Week attributes
    week_of_year        INT,
    week_of_month       INT,

    -- Month attributes
    month_number        INT,                    -- 1-12
    month_name          VARCHAR(10),            -- January, February...
    month_abbr          VARCHAR(3),             -- Jan, Feb...

    -- Quarter attributes
    quarter             INT,                    -- 1-4
    quarter_name        VARCHAR(10),            -- Q1, Q2...

    -- Year attributes
    year                INT,
    fiscal_year         INT,
    fiscal_quarter      INT,

    -- Relative flags
    is_current_day      BOOLEAN,
    is_current_week     BOOLEAN,
    is_current_month    BOOLEAN
);

-- Generate date dimension (Python/PySpark)
from datetime import date, timedelta
import pandas as pd

def generate_date_dim(start_year, end_year):
    dates = pd.date_range(
        start=f'{start_year}-01-01',
        end=f'{end_year}-12-31',
        freq='D'
    )

    df = pd.DataFrame({'full_date': dates})
    df['date_key'] = df['full_date'].dt.strftime('%Y%m%d').astype(int)
    df['day_of_week'] = df['full_date'].dt.dayofweek + 1
    df['day_name'] = df['full_date'].dt.day_name()
    df['month_number'] = df['full_date'].dt.month
    df['month_name'] = df['full_date'].dt.month_name()
    df['quarter'] = df['full_date'].dt.quarter
    df['year'] = df['full_date'].dt.year
    df['is_weekend'] = df['day_of_week'].isin([6, 7])

    return df

Surrogate Keys vs Natural Keys

┌─────────────────────────────────────────────────────────────┐
│            Surrogate Keys vs Natural Keys                   │
├─────────────────────────────────────────────────────────────┤
│  Natural Key                 │  Surrogate Key              │
│  (Business identifier)       │  (System-generated)         │
├──────────────────────────────┼─────────────────────────────┤
│  product_sku: 'SKU-12345'    │  product_key: 1001          │
│  customer_email: 'a@b.com'   │  customer_key: 5432         │
│  order_number: 'ORD-2024-01' │  order_key: 98765           │
├──────────────────────────────┼─────────────────────────────┤
│  Pros:                       │  Pros:                      │
│  • Meaningful to business    │  • Consistent, small size   │
│  • No lookup needed          │  • Handles key changes      │
│                              │  • Better join performance  │
│  Cons:                       │  • Supports SCD             │
│  • May change over time      │                             │
│  • Variable size/format      │  Cons:                      │
│  • From multiple sources     │  • Requires lookup          │
│                              │  • No business meaning      │
└──────────────────────────────┴─────────────────────────────┘

Best Practice: Use surrogate keys in dimension tables,
              keep natural keys for reference/lookup.

Grain Definition

The grain defines what one row in a fact table represents:

Grain Examples:
├── Sales Fact
│   └── One row per line item in an order
│
├── Daily Inventory Fact
│   └── One row per product per store per day
│
├── Website Clicks Fact
│   └── One row per click event
│
└── Monthly Summary Fact
    └── One row per customer per product per month

Questions to Define Grain:
1. What business process does this fact represent?
2. What is the most atomic level of detail?
3. What dimensions describe each fact row?

-- Example: Defining grain for order fact
-- Grain: One row per order line item

CREATE TABLE fact_order_line (
    order_line_key      BIGINT PRIMARY KEY,
    order_key           BIGINT,
    date_key            INT,
    product_key         INT,
    customer_key        INT,

    -- At this grain, we capture:
    quantity            INT,          -- Per line item
    unit_price          DECIMAL(10,2),
    line_total          DECIMAL(12,2),
    discount_percent    DECIMAL(5,2)
);

Best Practices

Define grain first: Always start with the grain before adding dimensions
Use surrogate keys: Integer keys for better join performance
Conformed dimensions: Share dimensions across fact tables for consistency
Avoid null foreign keys: Use "Unknown" or "N/A" dimension rows instead
Document everything: Maintain a data dictionary with business definitions
Design for queries: Model based on how data will be analyzed

Master Data Modeling

Our Data Engineering program covers dimensional modeling, data vault, and warehouse design with hands-on projects.

Explore Data Engineering Program

Data Modeling