What is Data Modeling?
Data modeling is the process of creating a visual representation of data structures and their relationships. In data engineering, it focuses on organizing data for efficient storage, retrieval, and analysis in data warehouses and analytical systems.
Good data models make queries faster, data easier to understand, and analytics more reliable.
OLTP vs OLAP
Understanding the difference is crucial for choosing the right model:
┌─────────────────────────────────────────────────────────────┐
│ OLTP vs OLAP │
├─────────────────────────────────────────────────────────────┤
│ OLTP (Transactional) │ OLAP (Analytical) │
├──────────────────────────────┼─────────────────────────────┤
│ • Many small transactions │ • Complex queries │
│ • Insert, Update, Delete │ • Read-heavy workloads │
│ • Normalized (3NF) │ • Denormalized │
│ • Current data │ • Historical data │
│ • Application databases │ • Data warehouses │
│ • MySQL, PostgreSQL │ • Redshift, Snowflake │
└──────────────────────────────┴─────────────────────────────┘
Dimensional Modeling Concepts
Dimensional modeling organizes data into facts and dimensions:
- Facts: Measurable events (sales, clicks, transactions) with numeric values
- Dimensions: Context for facts (who, what, when, where, why)
- Measures: Numeric values in facts (amount, quantity, duration)
- Attributes: Descriptive fields in dimensions (name, category, region)
-- Example: Sales Fact Table
CREATE TABLE fact_sales (
sale_id BIGINT PRIMARY KEY,
date_key INT REFERENCES dim_date(date_key),
product_key INT REFERENCES dim_product(product_key),
customer_key INT REFERENCES dim_customer(customer_key),
store_key INT REFERENCES dim_store(store_key),
-- Measures (numeric, aggregatable)
quantity INT,
unit_price DECIMAL(10,2),
total_amount DECIMAL(12,2),
discount_amount DECIMAL(10,2)
);
-- Example: Product Dimension Table
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_id VARCHAR(20), -- Natural key
product_name VARCHAR(100),
category VARCHAR(50),
subcategory VARCHAR(50),
brand VARCHAR(50),
unit_cost DECIMAL(10,2)
);
Star Schema
The most common dimensional model - simple and fast for queries:
┌─────────────┐
│ dim_date │
└──────┬──────┘
│
┌─────────────┐ ┌───────┴───────┐ ┌─────────────┐
│ dim_product │────│ fact_sales │────│ dim_customer│
└─────────────┘ └───────┬───────┘ └─────────────┘
│
┌──────┴──────┐
│ dim_store │
└─────────────┘
Characteristics:
• Fact table in center, dimensions around it
• Dimensions are denormalized (flat)
• Simple joins (one level)
• Best query performance
• Some data redundancy
-- Star Schema Query Example
SELECT
d.year,
d.month_name,
p.category,
c.customer_segment,
SUM(f.total_amount) as total_sales,
COUNT(DISTINCT f.sale_id) as num_transactions
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_customer c ON f.customer_key = c.customer_key
WHERE d.year = 2024
GROUP BY d.year, d.month_name, p.category, c.customer_segment
ORDER BY total_sales DESC;
Snowflake Schema
Normalized version of star schema - dimensions are split into sub-dimensions:
┌────────────┐
│dim_category│
└─────┬──────┘
│
┌─────┴──────┐ ┌─────────────┐
│dim_product │─────────│ fact_sales │
└────────────┘ └──────┬──────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌─────┴─────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ dim_date │ │dim_customer │ │ dim_store │
└───────────┘ └──────┬──────┘ └──────┬──────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ dim_segment │ │ dim_region │
└─────────────┘ └─────────────┘
Characteristics:
• Dimensions normalized into sub-tables
• Less data redundancy
• More complex joins
• Slower query performance
• Better for dimension maintenance
Data Vault
A modeling approach designed for agility and auditability:
Data Vault Components:
├── Hubs → Business keys (unique identifiers)
├── Links → Relationships between hubs
└── Satellites → Descriptive attributes with history
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ hub_customer│◄───────│link_cust_ord│───────►│ hub_order │
└──────┬──────┘ └─────────────┘ └──────┬──────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│sat_customer │ │ sat_order │
│ _details │ │ _details │
└─────────────┘ └─────────────┘
-- Hub: Business Keys
CREATE TABLE hub_customer (
hub_customer_key BIGINT PRIMARY KEY,
customer_id VARCHAR(50) NOT NULL, -- Business key
load_date TIMESTAMP,
record_source VARCHAR(100)
);
-- Satellite: Descriptive Attributes (with history)
CREATE TABLE sat_customer_details (
hub_customer_key BIGINT REFERENCES hub_customer,
load_date TIMESTAMP,
load_end_date TIMESTAMP,
customer_name VARCHAR(100),
email VARCHAR(100),
phone VARCHAR(20),
address VARCHAR(200),
record_source VARCHAR(100),
PRIMARY KEY (hub_customer_key, load_date)
);
-- Link: Relationships
CREATE TABLE link_customer_order (
link_cust_ord_key BIGINT PRIMARY KEY,
hub_customer_key BIGINT REFERENCES hub_customer,
hub_order_key BIGINT REFERENCES hub_order,
load_date TIMESTAMP,
record_source VARCHAR(100)
);
Choosing the Right Model
┌─────────────────────────────────────────────────────────────────┐
│ When to Use Each Model │
├─────────────────────────────────────────────────────────────────┤
│ Star Schema │
│ ✓ Simple reporting and BI tools │
│ ✓ Known, stable requirements │
│ ✓ Query performance is priority │
│ ✓ Self-service analytics │
├─────────────────────────────────────────────────────────────────┤
│ Snowflake Schema │
│ ✓ Dimension tables are very large │
│ ✓ Storage is a concern │
│ ✓ Dimension data changes frequently │
│ ✓ Strict normalization required │
├─────────────────────────────────────────────────────────────────┤
│ Data Vault │
│ ✓ Multiple source systems │
│ ✓ Requirements change frequently │
│ ✓ Full audit trail needed │
│ ✓ Agile/iterative development │
│ ✓ Enterprise data warehouse │
└─────────────────────────────────────────────────────────────────┘
Date Dimension Best Practices
-- Comprehensive Date Dimension
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- YYYYMMDD format
full_date DATE NOT NULL,
-- Day attributes
day_of_week INT, -- 1-7
day_name VARCHAR(10), -- Monday, Tuesday...
day_of_month INT, -- 1-31
day_of_year INT, -- 1-366
is_weekend BOOLEAN,
is_holiday BOOLEAN,
holiday_name VARCHAR(50),
-- Week attributes
week_of_year INT,
week_of_month INT,
-- Month attributes
month_number INT, -- 1-12
month_name VARCHAR(10), -- January, February...
month_abbr VARCHAR(3), -- Jan, Feb...
-- Quarter attributes
quarter INT, -- 1-4
quarter_name VARCHAR(10), -- Q1, Q2...
-- Year attributes
year INT,
fiscal_year INT,
fiscal_quarter INT,
-- Relative flags
is_current_day BOOLEAN,
is_current_week BOOLEAN,
is_current_month BOOLEAN
);
-- Generate date dimension (Python/PySpark)
from datetime import date, timedelta
import pandas as pd
def generate_date_dim(start_year, end_year):
dates = pd.date_range(
start=f'{start_year}-01-01',
end=f'{end_year}-12-31',
freq='D'
)
df = pd.DataFrame({'full_date': dates})
df['date_key'] = df['full_date'].dt.strftime('%Y%m%d').astype(int)
df['day_of_week'] = df['full_date'].dt.dayofweek + 1
df['day_name'] = df['full_date'].dt.day_name()
df['month_number'] = df['full_date'].dt.month
df['month_name'] = df['full_date'].dt.month_name()
df['quarter'] = df['full_date'].dt.quarter
df['year'] = df['full_date'].dt.year
df['is_weekend'] = df['day_of_week'].isin([6, 7])
return df
Surrogate Keys vs Natural Keys
┌─────────────────────────────────────────────────────────────┐
│ Surrogate Keys vs Natural Keys │
├─────────────────────────────────────────────────────────────┤
│ Natural Key │ Surrogate Key │
│ (Business identifier) │ (System-generated) │
├──────────────────────────────┼─────────────────────────────┤
│ product_sku: 'SKU-12345' │ product_key: 1001 │
│ customer_email: 'a@b.com' │ customer_key: 5432 │
│ order_number: 'ORD-2024-01' │ order_key: 98765 │
├──────────────────────────────┼─────────────────────────────┤
│ Pros: │ Pros: │
│ • Meaningful to business │ • Consistent, small size │
│ • No lookup needed │ • Handles key changes │
│ │ • Better join performance │
│ Cons: │ • Supports SCD │
│ • May change over time │ │
│ • Variable size/format │ Cons: │
│ • From multiple sources │ • Requires lookup │
│ │ • No business meaning │
└──────────────────────────────┴─────────────────────────────┘
Best Practice: Use surrogate keys in dimension tables,
keep natural keys for reference/lookup.
Grain Definition
The grain defines what one row in a fact table represents:
Grain Examples:
├── Sales Fact
│ └── One row per line item in an order
│
├── Daily Inventory Fact
│ └── One row per product per store per day
│
├── Website Clicks Fact
│ └── One row per click event
│
└── Monthly Summary Fact
└── One row per customer per product per month
Questions to Define Grain:
1. What business process does this fact represent?
2. What is the most atomic level of detail?
3. What dimensions describe each fact row?
-- Example: Defining grain for order fact
-- Grain: One row per order line item
CREATE TABLE fact_order_line (
order_line_key BIGINT PRIMARY KEY,
order_key BIGINT,
date_key INT,
product_key INT,
customer_key INT,
-- At this grain, we capture:
quantity INT, -- Per line item
unit_price DECIMAL(10,2),
line_total DECIMAL(12,2),
discount_percent DECIMAL(5,2)
);
Best Practices
- Define grain first: Always start with the grain before adding dimensions
- Use surrogate keys: Integer keys for better join performance
- Conformed dimensions: Share dimensions across fact tables for consistency
- Avoid null foreign keys: Use "Unknown" or "N/A" dimension rows instead
- Document everything: Maintain a data dictionary with business definitions
- Design for queries: Model based on how data will be analyzed
Master Data Modeling
Our Data Engineering program covers dimensional modeling, data vault, and warehouse design with hands-on projects.
Explore Data Engineering Program