What is Anomaly Detection?
Anomaly detection (also called outlier detection) is the process of identifying data points, events, or observations that deviate significantly from the expected pattern in a dataset. These unusual instances often indicate critical information such as fraud, equipment failure, or security breaches.
Unlike supervised learning where you have labeled examples of both normal and anomalous data, anomaly detection often works with mostly normal data and must identify rare, unusual occurrences without explicit examples.
Key characteristics: Anomalies are rare (typically less than 5% of data), significantly different from normal patterns, and often the most valuable insights in your dataset.
Why Anomaly Detection Matters
Anomaly detection has critical applications across industries:
- Fraud Detection: Identify fraudulent credit card transactions, insurance claims, or online activities
- Network Security: Detect intrusions, DDoS attacks, and unusual network traffic patterns
- Healthcare: Monitor patient vitals, detect disease outbreaks, identify medical imaging abnormalities
- Manufacturing: Predict equipment failures, detect defective products on assembly lines
- Finance: Detect market manipulation, unusual trading patterns, money laundering
- IoT & Sensors: Monitor sensor data for equipment health, environmental anomalies
When to Use Anomaly Detection
Choose anomaly detection when:
- You have mostly normal data with rare abnormal cases
- Labeling anomalies is expensive or impossible
- The nature of anomalies changes over time (fraud patterns evolve)
- You need to monitor systems in real-time for unusual behavior
- False positives are acceptable (can be reviewed by humans)
- The cost of missing an anomaly is high (safety, security, financial loss)
Types of Anomalies
1. Point Anomalies
Individual data points that are anomalous relative to the rest of the data. Example: A single fraudulent transaction in a stream of normal purchases.
2. Contextual Anomalies
Data points that are anomalous in a specific context. Example: Temperature of 70°F is normal in summer but anomalous in winter.
3. Collective Anomalies
A collection of data points that together represent an anomaly. Example: Multiple small withdrawals that together indicate suspicious activity.
Statistical Methods
Statistical approaches assume data follows a known distribution and flag points that fall outside expected ranges.
Z-Score Method
Measures how many standard deviations a point is from the mean. Points beyond 3 standard deviations are typically considered anomalies.
import numpy as np
from scipy import stats
# Calculate z-scores
z_scores = np.abs(stats.zscore(data))
# Flag anomalies (threshold = 3)
anomalies = z_scores > 3
print(f"Found {anomalies.sum()} anomalies")
Interquartile Range (IQR)
Uses quartiles to define normal range. Points below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are anomalies.
import pandas as pd
# Calculate IQR
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify anomalies
anomalies = (data < lower_bound) | (data > upper_bound)
Isolation Forest
Isolation Forest is a powerful tree-based algorithm that isolates anomalies instead of profiling normal points. It works on the principle that anomalies are few and different, making them easier to isolate.
How it Works
- Randomly select a feature and split value
- Recursively partition the data
- Anomalies require fewer splits to isolate (shorter path length)
- Average path length across multiple trees determines anomaly score
Implementation
from sklearn.ensemble import IsolationForest
import pandas as pd
# Load your data
df = pd.read_csv('transactions.csv')
X = df[['amount', 'time', 'merchant_type']]
# Train Isolation Forest
iso_forest = IsolationForest(
contamination=0.05, # Expected proportion of anomalies
random_state=42,
n_estimators=100
)
# Fit and predict (-1 for anomalies, 1 for normal)
predictions = iso_forest.fit_predict(X)
# Get anomaly scores (lower = more anomalous)
scores = iso_forest.score_samples(X)
# Add to dataframe
df['anomaly'] = predictions
df['anomaly_score'] = scores
# View anomalies
anomalies_df = df[df['anomaly'] == -1]
print(f"Found {len(anomalies_df)} anomalies")
print(anomalies_df.head())
Advantages
- Fast and scalable to large datasets
- No assumptions about data distribution
- Works well in high-dimensional spaces
- Low memory requirements
One-Class SVM
One-Class Support Vector Machine learns a decision boundary around normal data. Points that fall outside this boundary are considered anomalies.
Implementation
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
# Scale features (important for SVM)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train One-Class SVM
oc_svm = OneClassSVM(
kernel='rbf', # Radial basis function
gamma='auto', # Kernel coefficient
nu=0.05 # Upper bound on fraction of anomalies
)
# Fit and predict
predictions = oc_svm.fit_predict(X_scaled)
# -1 for anomalies, 1 for normal
df['is_anomaly'] = predictions == -1
# Decision function gives distance from boundary
df['decision_score'] = oc_svm.decision_function(X_scaled)
When to Use One-Class SVM
- Small to medium-sized datasets (computationally expensive on large data)
- Need a well-defined decision boundary
- Data has complex, non-linear patterns
- You can tune hyperparameters properly
Autoencoders for Anomaly Detection
Autoencoders are neural networks trained to reconstruct input data. They learn to compress normal patterns into a lower-dimensional representation. Anomalies, being different, have high reconstruction error.
How Autoencoders Work
- Encoder: Compresses input into a latent representation
- Decoder: Reconstructs the original input from the latent space
- Training: Minimize reconstruction error on normal data
- Detection: High reconstruction error indicates anomaly
Implementation with Keras
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.preprocessing import StandardScaler
# Prepare data (use only normal data for training)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Build autoencoder
input_dim = X_train_scaled.shape[1]
encoding_dim = 8 # Compression factor
# Encoder
encoder_input = layers.Input(shape=(input_dim,))
encoded = layers.Dense(32, activation='relu')(encoder_input)
encoded = layers.Dense(16, activation='relu')(encoded)
encoded = layers.Dense(encoding_dim, activation='relu')(encoded)
# Decoder
decoded = layers.Dense(16, activation='relu')(encoded)
decoded = layers.Dense(32, activation='relu')(decoded)
decoded = layers.Dense(input_dim, activation='linear')(decoded)
# Complete autoencoder
autoencoder = keras.Model(encoder_input, decoded)
autoencoder.compile(optimizer='adam', loss='mse')
# Train on normal data only
history = autoencoder.fit(
X_train_scaled, X_train_scaled,
epochs=50,
batch_size=32,
validation_split=0.1,
verbose=1
)
# Detect anomalies on test data
X_test_scaled = scaler.transform(X_test)
reconstructions = autoencoder.predict(X_test_scaled)
# Calculate reconstruction error
mse = np.mean(np.power(X_test_scaled - reconstructions, 2), axis=1)
# Set threshold (e.g., 95th percentile of training errors)
train_reconstructions = autoencoder.predict(X_train_scaled)
train_mse = np.mean(np.power(X_train_scaled - train_reconstructions, 2), axis=1)
threshold = np.percentile(train_mse, 95)
# Flag anomalies
anomalies = mse > threshold
print(f"Detected {anomalies.sum()} anomalies")
Advantages of Autoencoders
- Excellent for high-dimensional data (images, sensor data)
- Capture complex, non-linear patterns
- No assumptions about data distribution
- Can be adapted with different architectures (CNN, LSTM for sequences)
Comparing Methods
| Method | Best For | Pros | Cons |
|---|---|---|---|
| Statistical | Simple, 1D data | Fast, interpretable | Assumes distribution |
| Isolation Forest | Large, mixed data | Scalable, robust | Less interpretable |
| One-Class SVM | Small, structured | Strong boundaries | Slow, needs tuning |
| Autoencoders | Images, sequences | Handles complexity | Needs more data |
Best Practices
- Understand Your Data: Know what "normal" looks like before detecting anomalies
- Feature Engineering: Create domain-specific features that highlight anomalies
- Set Appropriate Thresholds: Balance false positives vs false negatives based on business cost
- Validate with Domain Experts: Have experts review detected anomalies to refine your model
- Monitor Over Time: Normal patterns change; retrain models regularly
- Combine Methods: Ensemble different techniques for better results
- Handle Imbalance: Use contamination parameter carefully; anomalies are rare by definition
- Visualize: Use dimensionality reduction (PCA, t-SNE) to visualize anomalies
Complete Example: Credit Card Fraud Detection
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Load transaction data
df = pd.read_csv('credit_card_transactions.csv')
# Feature engineering
df['hour'] = pd.to_datetime(df['timestamp']).dt.hour
df['amount_log'] = np.log1p(df['amount']) # Log transform for skewed amounts
# Select features
features = ['amount', 'amount_log', 'hour', 'merchant_category', 'distance_from_home']
X = df[features]
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train Isolation Forest
iso_forest = IsolationForest(
contamination=0.01, # Expect 1% fraud
n_estimators=100,
max_samples='auto',
random_state=42
)
# Predict anomalies
df['is_fraud_predicted'] = iso_forest.fit_predict(X_scaled)
df['anomaly_score'] = iso_forest.score_samples(X_scaled)
# Sort by most anomalous
df_sorted = df.sort_values('anomaly_score')
high_risk = df_sorted.head(100) # Top 100 most suspicious
print(f"Flagged {(df['is_fraud_predicted'] == -1).sum()} transactions as potential fraud")
print("\nMost suspicious transactions:")
print(high_risk[['amount', 'merchant_category', 'anomaly_score']].head(10))
# If you have labels, evaluate performance
if 'is_fraud_actual' in df.columns:
from sklearn.metrics import classification_report
print("\nPerformance:")
print(classification_report(
df['is_fraud_actual'],
df['is_fraud_predicted'] == -1
))
Master Anomaly Detection with Expert Guidance
Our Data Science program covers anomaly detection in depth, from statistical methods to deep learning. Build fraud detection systems and learn from real-world case studies.
Explore Data Science Program