What is a Data Lake?

A data lake is a centralized repository that stores all your structured and unstructured data at any scale. Unlike traditional data warehouses, data lakes can store raw data without first defining the schema, following a "schema-on-read" approach.

Modern data lakes have evolved into "lakehouses" - combining the flexibility of data lakes with the data management features of data warehouses.

Data Lake vs Data Warehouse

  • Schema: Lake uses schema-on-read; Warehouse uses schema-on-write
  • Data types: Lake stores any format; Warehouse stores structured data
  • Cost: Lakes use cheap object storage; Warehouses use expensive compute
  • Users: Lakes serve data scientists; Warehouses serve business analysts
  • Processing: Lakes support batch and streaming; Warehouses focus on SQL queries

Data Lake Architecture

# Medallion Architecture (Bronze, Silver, Gold)

data_lake/
├── bronze/              # Raw data (landing zone)
│   ├── sales/
│   │   └── 2024/01/01/raw_sales.parquet
│   └── customers/
│       └── 2024/01/01/raw_customers.json
├── silver/              # Cleaned, validated data
│   ├── sales/
│   │   └── cleaned_sales.parquet
│   └── customers/
│       └── validated_customers.parquet
└── gold/                # Business-ready aggregates
    ├── daily_revenue.parquet
    └── customer_360.parquet

Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions to data lakes:

from delta import *
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DeltaLake") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Write data as Delta table
df.write.format("delta").save("/data/delta/events")

# Read Delta table
events = spark.read.format("delta").load("/data/delta/events")

# Update data (ACID transaction)
from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/data/delta/events")

delta_table.update(
    condition="event_type = 'purchase'",
    set={"status": "'completed'"}
)

# Delete data
delta_table.delete("event_date < '2023-01-01'")

# Merge (upsert)
delta_table.alias("target").merge(
    new_data.alias("source"),
    "target.id = source.id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

Time Travel with Delta Lake

# Read data at a specific version
df_v0 = spark.read.format("delta") \
    .option("versionAsOf", 0) \
    .load("/data/delta/events")

# Read data at a specific timestamp
df_historical = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-01 00:00:00") \
    .load("/data/delta/events")

# View table history
history = delta_table.history()
history.show()

# Restore to previous version
delta_table.restoreToVersion(5)

# Vacuum old versions (clean up)
delta_table.vacuum(168)  # Retain 7 days

Apache Iceberg

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Iceberg") \
    .config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.iceberg.type", "hadoop") \
    .config("spark.sql.catalog.iceberg.warehouse", "s3://bucket/warehouse") \
    .getOrCreate()

# Create Iceberg table
spark.sql("""
    CREATE TABLE iceberg.db.events (
        id BIGINT,
        event_type STRING,
        timestamp TIMESTAMP,
        data STRING
    ) USING iceberg
    PARTITIONED BY (days(timestamp))
""")

# Insert data
spark.sql("""
    INSERT INTO iceberg.db.events
    SELECT * FROM staging_events
""")

# Schema evolution
spark.sql("""
    ALTER TABLE iceberg.db.events
    ADD COLUMN user_id BIGINT
""")

# Time travel
spark.sql("""
    SELECT * FROM iceberg.db.events
    FOR SYSTEM_TIME AS OF '2024-01-01 00:00:00'
""")

Data Lake on Cloud

# AWS S3 Data Lake with Glue Catalog
import boto3

# Create S3 bucket for data lake
s3 = boto3.client('s3')
s3.create_bucket(
    Bucket='my-data-lake',
    CreateBucketConfiguration={'LocationConstraint': 'us-west-2'}
)

# Configure Spark to read from S3
spark = SparkSession.builder \
    .appName("S3DataLake") \
    .config("spark.hadoop.fs.s3a.access.key", ACCESS_KEY) \
    .config("spark.hadoop.fs.s3a.secret.key", SECRET_KEY) \
    .getOrCreate()

# Read Parquet from S3
df = spark.read.parquet("s3a://my-data-lake/bronze/events/")

# Write partitioned data to S3
df.write \
    .partitionBy("year", "month", "day") \
    .format("parquet") \
    .save("s3a://my-data-lake/silver/events/")

Data Lake Governance

  • Data Catalog: Track metadata, lineage, and schemas (AWS Glue, Apache Atlas)
  • Access Control: Fine-grained permissions (Lake Formation, Ranger)
  • Data Quality: Validate data at ingestion (Great Expectations, dbt tests)
  • Lineage: Track data transformations end-to-end
  • Encryption: Encrypt data at rest and in transit
  • Retention: Implement lifecycle policies for cost optimization

Best Practices

  • Use columnar formats: Parquet or ORC for analytics
  • Partition wisely: Choose partition keys based on query patterns
  • Implement medallion architecture: Bronze, Silver, Gold layers
  • Enable versioning: Use Delta Lake or Iceberg for ACID support
  • Catalog everything: Metadata is crucial for discoverability
  • Compact small files: Regular maintenance for performance

Master Data Lakes with Expert Mentorship

Our Data Engineering program covers modern data lake architecture and implementation. Build scalable data platforms with guidance from industry experts.

Explore Data Engineering Program

Related Articles