Apache Spark: The Complete Guide to Big Data Processing

What is Apache Spark?

Apache Spark is a unified analytics engine for large-scale data processing. Originally developed at UC Berkeley, Spark has become the de facto standard for big data processing, offering speeds up to 100x faster than Hadoop MapReduce for in-memory operations.

Spark provides APIs in Python (PySpark), Scala, Java, and R, with built-in modules for SQL, streaming, machine learning, and graph processing.

Why Spark?

Speed: In-memory computing makes it 100x faster than MapReduce
Unified: One engine for batch, streaming, ML, and graph processing
Easy to Use: High-level APIs that simplify complex operations
Scalable: Scales from single machine to thousands of nodes
Versatile: Runs on Hadoop, Kubernetes, standalone, or cloud

Spark Architecture

Spark uses a master-worker architecture:

Driver: The main program that creates SparkContext and coordinates execution
Cluster Manager: Allocates resources (YARN, Kubernetes, Mesos, Standalone)
Executors: Worker processes that run tasks and store data

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.executor.memory", "4g") \
    .config("spark.executor.cores", "2") \
    .getOrCreate()

# Access SparkContext
sc = spark.sparkContext

Working with DataFrames

DataFrames are the primary abstraction in Spark - distributed collections of data organized into named columns:

# Read data
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df = spark.read.parquet("data.parquet")
df = spark.read.json("data.json")

# Basic operations
df.show(5)
df.printSchema()
df.describe().show()

# Select and filter
df.select("name", "age").show()
df.filter(df.age > 25).show()
df.filter((df.age > 25) & (df.city == "NYC")).show()

# Add/rename columns
from pyspark.sql.functions import col, lit, when

df = df.withColumn("age_plus_10", col("age") + 10)
df = df.withColumnRenamed("name", "full_name")
df = df.withColumn("category",
    when(col("age") < 30, "young")
    .when(col("age") < 50, "middle")
    .otherwise("senior")
)

Aggregations and Grouping

from pyspark.sql.functions import sum, avg, count, max, min

# Group by with aggregations
df.groupBy("department") \
    .agg(
        count("*").alias("employee_count"),
        avg("salary").alias("avg_salary"),
        max("salary").alias("max_salary")
    ).show()

# Multiple grouping columns
df.groupBy("department", "city") \
    .agg(sum("sales").alias("total_sales")) \
    .orderBy("total_sales", ascending=False) \
    .show()

# Window functions
from pyspark.sql.window import Window

window = Window.partitionBy("department").orderBy("salary")
df = df.withColumn("rank", rank().over(window))
df = df.withColumn("running_total", sum("salary").over(window))

Spark SQL

# Register DataFrame as temp view
df.createOrReplaceTempView("employees")

# Run SQL queries
result = spark.sql("""
    SELECT department,
           COUNT(*) as emp_count,
           AVG(salary) as avg_salary
    FROM employees
    WHERE age > 25
    GROUP BY department
    HAVING COUNT(*) > 10
    ORDER BY avg_salary DESC
""")

result.show()

# Complex queries with CTEs
spark.sql("""
    WITH dept_stats AS (
        SELECT department, AVG(salary) as avg_sal
        FROM employees
        GROUP BY department
    )
    SELECT e.*, d.avg_sal
    FROM employees e
    JOIN dept_stats d ON e.department = d.department
    WHERE e.salary > d.avg_sal
""").show()

Joins

# Different join types
joined = df1.join(df2, df1.id == df2.id, "inner")
joined = df1.join(df2, "id", "left")  # Simpler syntax for same column name
joined = df1.join(df2, ["id", "date"], "outer")

# Broadcast join for small tables
from pyspark.sql.functions import broadcast

# Small table is broadcast to all workers
result = large_df.join(broadcast(small_df), "key")

Performance Optimization

Partitioning: Control data distribution across nodes
Caching: Persist frequently used DataFrames in memory
Broadcast: Broadcast small tables to avoid shuffles
Predicate Pushdown: Push filters down to data source

# Repartition data
df = df.repartition(100)  # Increase partitions
df = df.repartition("date")  # Partition by column
df = df.coalesce(10)  # Reduce partitions (no shuffle)

# Cache DataFrames
df.cache()  # or df.persist()
df.count()  # Trigger caching

# Check execution plan
df.explain(True)

# Write partitioned data
df.write.partitionBy("year", "month") \
    .parquet("output/data")

Spark Streaming

# Structured Streaming
from pyspark.sql.functions import window

# Read streaming data
stream_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "events") \
    .load()

# Process stream
processed = stream_df \
    .selectExpr("CAST(value AS STRING)") \
    .groupBy(window("timestamp", "5 minutes")) \
    .count()

# Write stream
query = processed.writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()

Best Practices

Use DataFrames over RDDs: Better optimization and cleaner code
Avoid UDFs when possible: Built-in functions are optimized
Right-size partitions: ~128MB per partition is ideal
Use appropriate file formats: Parquet for analytics workloads
Monitor and tune: Use Spark UI to identify bottlenecks
Handle skew: Salt keys or use adaptive query execution

Master Apache Spark with Expert Mentorship

Our Data Engineering program covers Spark from basics to production optimization. Build real big data pipelines with guidance from industry experts.

Explore Data Engineering Program

Apache Spark