What is Apache Spark?
Apache Spark is a unified analytics engine for large-scale data processing. Originally developed at UC Berkeley, Spark has become the de facto standard for big data processing, offering speeds up to 100x faster than Hadoop MapReduce for in-memory operations.
Spark provides APIs in Python (PySpark), Scala, Java, and R, with built-in modules for SQL, streaming, machine learning, and graph processing.
Why Spark?
- Speed: In-memory computing makes it 100x faster than MapReduce
- Unified: One engine for batch, streaming, ML, and graph processing
- Easy to Use: High-level APIs that simplify complex operations
- Scalable: Scales from single machine to thousands of nodes
- Versatile: Runs on Hadoop, Kubernetes, standalone, or cloud
Spark Architecture
Spark uses a master-worker architecture:
- Driver: The main program that creates SparkContext and coordinates execution
- Cluster Manager: Allocates resources (YARN, Kubernetes, Mesos, Standalone)
- Executors: Worker processes that run tasks and store data
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder \
.appName("MyApp") \
.config("spark.executor.memory", "4g") \
.config("spark.executor.cores", "2") \
.getOrCreate()
# Access SparkContext
sc = spark.sparkContext
Working with DataFrames
DataFrames are the primary abstraction in Spark - distributed collections of data organized into named columns:
# Read data
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df = spark.read.parquet("data.parquet")
df = spark.read.json("data.json")
# Basic operations
df.show(5)
df.printSchema()
df.describe().show()
# Select and filter
df.select("name", "age").show()
df.filter(df.age > 25).show()
df.filter((df.age > 25) & (df.city == "NYC")).show()
# Add/rename columns
from pyspark.sql.functions import col, lit, when
df = df.withColumn("age_plus_10", col("age") + 10)
df = df.withColumnRenamed("name", "full_name")
df = df.withColumn("category",
when(col("age") < 30, "young")
.when(col("age") < 50, "middle")
.otherwise("senior")
)
Aggregations and Grouping
from pyspark.sql.functions import sum, avg, count, max, min
# Group by with aggregations
df.groupBy("department") \
.agg(
count("*").alias("employee_count"),
avg("salary").alias("avg_salary"),
max("salary").alias("max_salary")
).show()
# Multiple grouping columns
df.groupBy("department", "city") \
.agg(sum("sales").alias("total_sales")) \
.orderBy("total_sales", ascending=False) \
.show()
# Window functions
from pyspark.sql.window import Window
window = Window.partitionBy("department").orderBy("salary")
df = df.withColumn("rank", rank().over(window))
df = df.withColumn("running_total", sum("salary").over(window))
Spark SQL
# Register DataFrame as temp view
df.createOrReplaceTempView("employees")
# Run SQL queries
result = spark.sql("""
SELECT department,
COUNT(*) as emp_count,
AVG(salary) as avg_salary
FROM employees
WHERE age > 25
GROUP BY department
HAVING COUNT(*) > 10
ORDER BY avg_salary DESC
""")
result.show()
# Complex queries with CTEs
spark.sql("""
WITH dept_stats AS (
SELECT department, AVG(salary) as avg_sal
FROM employees
GROUP BY department
)
SELECT e.*, d.avg_sal
FROM employees e
JOIN dept_stats d ON e.department = d.department
WHERE e.salary > d.avg_sal
""").show()
Joins
# Different join types
joined = df1.join(df2, df1.id == df2.id, "inner")
joined = df1.join(df2, "id", "left") # Simpler syntax for same column name
joined = df1.join(df2, ["id", "date"], "outer")
# Broadcast join for small tables
from pyspark.sql.functions import broadcast
# Small table is broadcast to all workers
result = large_df.join(broadcast(small_df), "key")
Performance Optimization
- Partitioning: Control data distribution across nodes
- Caching: Persist frequently used DataFrames in memory
- Broadcast: Broadcast small tables to avoid shuffles
- Predicate Pushdown: Push filters down to data source
# Repartition data
df = df.repartition(100) # Increase partitions
df = df.repartition("date") # Partition by column
df = df.coalesce(10) # Reduce partitions (no shuffle)
# Cache DataFrames
df.cache() # or df.persist()
df.count() # Trigger caching
# Check execution plan
df.explain(True)
# Write partitioned data
df.write.partitionBy("year", "month") \
.parquet("output/data")
Spark Streaming
# Structured Streaming
from pyspark.sql.functions import window
# Read streaming data
stream_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "events") \
.load()
# Process stream
processed = stream_df \
.selectExpr("CAST(value AS STRING)") \
.groupBy(window("timestamp", "5 minutes")) \
.count()
# Write stream
query = processed.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
Best Practices
- Use DataFrames over RDDs: Better optimization and cleaner code
- Avoid UDFs when possible: Built-in functions are optimized
- Right-size partitions: ~128MB per partition is ideal
- Use appropriate file formats: Parquet for analytics workloads
- Monitor and tune: Use Spark UI to identify bottlenecks
- Handle skew: Salt keys or use adaptive query execution
Master Apache Spark with Expert Mentorship
Our Data Engineering program covers Spark from basics to production optimization. Build real big data pipelines with guidance from industry experts.
Explore Data Engineering Program