Linear Algebra for Machine Learning

Why Linear Algebra Matters in ML

Linear algebra is the language of machine learning. Every dataset is a matrix, every feature is a vector, and most ML algorithms are just matrix operations under the hood.

Understanding linear algebra helps you:

Understand how algorithms work: Not just use them as black boxes
Debug models: Recognize dimension mismatches and numerical issues
Optimize performance: Vectorized code runs 100x faster than loops
Read research papers: Most ML papers use matrix notation

Vectors: The Building Blocks

A vector is simply an ordered list of numbers. In ML, each data point is a vector where each element represents a feature.

import numpy as np

# A data point: [age, income, credit_score]
customer = np.array([35, 75000, 720])

# This is a 3-dimensional vector
print(f"Shape: {customer.shape}")  # (3,)
print(f"Dimensions: {customer.ndim}")  # 1

# Real-world example: Image as a vector
# A 28x28 grayscale image = 784-dimensional vector
image = np.random.rand(784)  # Flattened image

# Vector operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Element-wise addition
print(a + b)  # [5, 7, 9]

# Scalar multiplication
print(2 * a)  # [2, 4, 6]

# Dot product (most important operation in ML!)
dot_product = np.dot(a, b)  # 1*4 + 2*5 + 3*6 = 32
print(f"Dot product: {dot_product}")

The Dot Product: Heart of ML

The dot product measures similarity between vectors and is the core operation in neural networks, SVMs, and recommendation systems.

# Why dot product matters in ML:

# 1. Linear Regression is just a dot product
weights = np.array([0.5, 0.3, 0.2])  # learned weights
features = np.array([100, 50, 30])   # input features
prediction = np.dot(weights, features)  # weighted sum

# 2. Cosine similarity (used in NLP, recommendations)
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

doc1 = np.array([1, 0, 1, 0])  # word frequencies
doc2 = np.array([1, 0, 1, 1])
similarity = cosine_similarity(doc1, doc2)
print(f"Document similarity: {similarity:.2f}")  # 0.87

# 3. Neural network layer is matrix of dot products
# output = activation(weights @ inputs + bias)

Matrices: Data in Tabular Form

A matrix is a 2D array of numbers. Your entire dataset is a matrix where rows are samples and columns are features.

# Dataset as a matrix
# 100 customers, 5 features each
X = np.random.rand(100, 5)
print(f"Shape: {X.shape}")  # (100, 5) = (samples, features)

# Matrix operations
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Matrix multiplication (rows of A dot columns of B)
C = np.dot(A, B)  # or A @ B
print(C)
# [[19, 22],
#  [43, 50]]

# Transpose: swap rows and columns
print(A.T)
# [[1, 3],
#  [2, 4]]

# Why transpose matters:
# - Converting row vectors to column vectors
# - Computing covariance matrices
# - Backpropagation in neural networks

# Identity matrix: AI = IA = A
I = np.eye(3)
print(I)
# [[1, 0, 0],
#  [0, 1, 0],
#  [0, 0, 1]]

Matrix Multiplication in ML

# Neural Network: Just matrix multiplications!

# Input layer: 100 samples, 784 features (images)
X = np.random.rand(100, 784)

# Hidden layer weights: 784 inputs -> 128 neurons
W1 = np.random.rand(784, 128)

# Output layer weights: 128 -> 10 classes
W2 = np.random.rand(128, 10)

# Forward pass (simplified)
hidden = X @ W1          # (100, 784) @ (784, 128) = (100, 128)
output = hidden @ W2     # (100, 128) @ (128, 10) = (100, 10)

print(f"Hidden shape: {hidden.shape}")
print(f"Output shape: {output.shape}")

# The magic: 100 predictions in one operation!
# Without matrices: 100 * 784 * 128 individual multiplications
# With matrices: one optimized operation

# Matrix multiplication rule:
# (m, n) @ (n, p) = (m, p)
# Inner dimensions must match!

Eigenvalues & Eigenvectors

Eigenvectors are special directions that don't change when a matrix transformation is applied - they only get scaled. This concept powers PCA, PageRank, and many ML algorithms.

# For a matrix A, eigenvector v satisfies: Av = λv
# λ (lambda) is the eigenvalue (scaling factor)

A = np.array([[4, 2], [1, 3]])

# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)

print(f"Eigenvalues: {eigenvalues}")   # [5, 2]
print(f"Eigenvectors:\n{eigenvectors}")

# Verify: A @ v = λ * v
v = eigenvectors[:, 0]  # First eigenvector
lambda1 = eigenvalues[0]  # First eigenvalue

print(f"A @ v = {A @ v}")
print(f"λ * v = {lambda1 * v}")
# They're equal!

# Why eigenvectors matter:
# 1. PCA: Find directions of maximum variance
# 2. PageRank: Find stable ranking of web pages
# 3. Spectral clustering: Find graph structure
# 4. Covariance analysis: Understand data spread

PCA: Eigenvalues in Action

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X = iris.data  # 150 samples, 4 features

# Manual PCA using eigenvalues
# Step 1: Center the data
X_centered = X - X.mean(axis=0)

# Step 2: Compute covariance matrix
cov_matrix = np.cov(X_centered.T)

# Step 3: Get eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Step 4: Sort by eigenvalue (variance explained)
sorted_idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_idx]
eigenvectors = eigenvectors[:, sorted_idx]

# Variance explained by each component
total_variance = eigenvalues.sum()
variance_ratio = eigenvalues / total_variance
print(f"Variance explained: {variance_ratio}")
# [0.92, 0.05, 0.02, 0.01] - First 2 components capture 97%!

# Using sklearn (does the same thing)
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(f"Reduced shape: {X_reduced.shape}")  # (150, 2)

Norms: Measuring Vector Size

# Norms measure the "size" or "length" of a vector

v = np.array([3, 4])

# L2 Norm (Euclidean distance) - most common
l2_norm = np.linalg.norm(v)  # sqrt(3² + 4²) = 5
print(f"L2 norm: {l2_norm}")

# L1 Norm (Manhattan distance)
l1_norm = np.linalg.norm(v, ord=1)  # |3| + |4| = 7
print(f"L1 norm: {l1_norm}")

# Why norms matter in ML:
# 1. Regularization: L2 (Ridge) and L1 (Lasso)
# 2. Gradient clipping: Prevent exploding gradients
# 3. Normalization: Scale features to unit norm

# L2 Regularization: Penalize large weights
weights = np.array([0.5, -0.3, 0.8, -0.1])
l2_penalty = np.linalg.norm(weights) ** 2
print(f"L2 penalty: {l2_penalty}")

# L1 Regularization: Encourages sparsity (zeros)
l1_penalty = np.linalg.norm(weights, ord=1)
print(f"L1 penalty: {l1_penalty}")

Broadcasting: NumPy Magic

# Broadcasting allows operations on arrays of different shapes

# Subtract mean from each column (centering)
X = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

mean = X.mean(axis=0)  # [4, 5, 6]
X_centered = X - mean  # Broadcasts mean to each row

print(X_centered)
# [[-3, -3, -3],
#  [ 0,  0,  0],
#  [ 3,  3,  3]]

# Normalize each row to unit length
norms = np.linalg.norm(X, axis=1, keepdims=True)
X_normalized = X / norms  # Each row divided by its norm

# Add bias to each sample
bias = np.array([0.1, 0.2, 0.3])
X_biased = X + bias  # Broadcasts to all rows

# Broadcasting rules:
# 1. Dimensions are compared from right to left
# 2. Dimensions match if equal or one of them is 1
# (3, 4) + (4,) = (3, 4)  ✓
# (3, 4) + (3, 1) = (3, 4)  ✓
# (3, 4) + (3,) = Error  ✗

Linear Algebra Cheat Sheet

# Essential operations for ML

import numpy as np

# Vector operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

np.dot(a, b)           # Dot product
np.linalg.norm(a)      # L2 norm (length)
a / np.linalg.norm(a)  # Unit vector

# Matrix operations
A = np.random.rand(3, 4)
B = np.random.rand(4, 5)

A @ B                  # Matrix multiplication
A.T                    # Transpose
np.linalg.inv(A @ A.T) # Inverse (square matrices only)
np.linalg.det(A @ A.T) # Determinant

# Decompositions
np.linalg.eig(A @ A.T) # Eigenvalues/vectors
np.linalg.svd(A)       # Singular Value Decomposition

# Solving linear systems: Ax = b
A = np.array([[3, 1], [1, 2]])
b = np.array([9, 8])
x = np.linalg.solve(A, b)  # More stable than inv(A) @ b

# Random matrices (for initialization)
np.random.rand(3, 4)   # Uniform [0, 1)
np.random.randn(3, 4)  # Standard normal
np.eye(3)              # Identity matrix
np.zeros((3, 4))       # Zeros
np.ones((3, 4))        # Ones

When You'll Use This

Neural Networks: Forward/backward pass are matrix multiplications
PCA: Eigenvalue decomposition for dimensionality reduction
Recommendation Systems: Matrix factorization (SVD)
Word Embeddings: Vectors represent word meanings
Image Processing: Images as matrices, convolutions as operations
Regularization: L1/L2 norms prevent overfitting

Master the Math of ML

Our Data Science program builds your mathematical intuition alongside practical skills.

Explore Data Science Program