Data Preprocessing Guide: Cleaning & Transforming Data

What is Data Preprocessing?

Data preprocessing is the critical process of transforming raw data into a clean, structured format suitable for machine learning models. It's often said that data scientists spend 80% of their time on data preparation - and for good reason.

Real-world data is messy: it contains missing values, outliers, inconsistent formats, and irrelevant features. Models trained on poorly preprocessed data will produce unreliable results, regardless of algorithm sophistication. Quality data preprocessing is the foundation of successful machine learning projects.

Why Data Preprocessing Matters

Improved Model Performance: Clean data leads to better predictions and higher accuracy
Faster Training: Proper scaling and encoding reduce computational overhead
Reduced Bias: Handling outliers and imbalanced data prevents skewed models
Better Interpretability: Standardized features make models easier to understand
Robust Predictions: Consistent data formats ensure reliable production performance

1. Handling Missing Values

Missing data is one of the most common problems. Your strategy depends on the amount and pattern of missingness.

Understanding Missing Data

import pandas as pd
import numpy as np

# Check for missing values
print(df.isnull().sum())

# Visualize missing data percentage
missing_percent = (df.isnull().sum() / len(df)) * 100
print(missing_percent[missing_percent > 0].sort_values(ascending=False))

# Visualize missing data patterns
import missingno as msno
msno.matrix(df)  # Shows patterns of missingness

Strategies for Handling Missing Values

Option 1: Remove Missing Data

Use when missing data is minimal (<5%) and randomly distributed.

# Remove rows with any missing values
df_clean = df.dropna()

# Remove rows where specific column is missing
df_clean = df.dropna(subset=['important_column'])

# Remove columns with too many missing values (>50%)
threshold = len(df) * 0.5
df_clean = df.dropna(axis=1, thresh=threshold)

Option 2: Simple Imputation

from sklearn.impute import SimpleImputer

# Numerical features: fill with mean/median
imputer_num = SimpleImputer(strategy='median')
df['age'] = imputer_num.fit_transform(df[['age']])

# Categorical features: fill with most frequent value
imputer_cat = SimpleImputer(strategy='most_frequent')
df['category'] = imputer_cat.fit_transform(df[['category']])

# Or use pandas directly
df['salary'].fillna(df['salary'].median(), inplace=True)
df['city'].fillna(df['city'].mode()[0], inplace=True)

Option 3: Advanced Imputation

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, KNNImputer

# Iterative imputation (models each feature with missing values)
iterative_imputer = IterativeImputer(max_iter=10, random_state=42)
df_imputed = pd.DataFrame(
    iterative_imputer.fit_transform(df),
    columns=df.columns
)

# KNN Imputation (uses similar samples)
knn_imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
    knn_imputer.fit_transform(df),
    columns=df.columns
)

2. Handling Outliers

Outliers are extreme values that can distort model training. They can be errors or genuine rare events.

Detecting Outliers

import matplotlib.pyplot as plt
import seaborn as sns

# Visualize outliers with box plots
plt.figure(figsize=(12, 6))
df.boxplot(column=['age', 'salary', 'score'])
plt.show()

# Statistical detection: Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['salary']))
outliers = df[z_scores > 3]  # Values beyond 3 standard deviations

# IQR (Interquartile Range) method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['salary'] < lower_bound) | (df['salary'] > upper_bound)]

Handling Outliers

# Option 1: Remove outliers
df_clean = df[(df['salary'] >= lower_bound) & (df['salary'] <= upper_bound)]

# Option 2: Cap outliers (winsorization)
df['salary'] = df['salary'].clip(lower=lower_bound, upper=upper_bound)

# Option 3: Transform data (log transformation)
df['salary_log'] = np.log1p(df['salary'])  # log(1 + x) to handle zeros

# Option 4: Robust scaling (less sensitive to outliers)
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df['salary_scaled'] = scaler.fit_transform(df[['salary']])

3. Feature Scaling and Normalization

Machine learning algorithms perform better when features are on similar scales. This is crucial for distance-based algorithms and gradient descent.

Standardization (Z-score Normalization)

Transforms features to have mean=0 and std=1. Use when data follows normal distribution.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use same scaling parameters

# Creates features with mean=0, std=1
# Formula: (x - mean) / std

Min-Max Normalization

Scales features to a fixed range [0, 1]. Use when you need bounded values.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Formula: (x - min) / (max - min)
# Result: all values between 0 and 1

Robust Scaling

Uses median and IQR, less sensitive to outliers.

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

# Formula: (x - median) / IQR
# Better for data with outliers

When to Use Each Scaler

StandardScaler: Most common. Use for normally distributed data, algorithms like SVM, KNN, PCA
MinMaxScaler: Use when you need bounded ranges (e.g., neural networks with sigmoid/tanh)
RobustScaler: Use when data has outliers that you want to keep
No Scaling: Tree-based models (Random Forest, XGBoost) don't need scaling

4. Encoding Categorical Variables

Machine learning models need numerical input. Convert categorical variables appropriately based on their nature.

One-Hot Encoding

Use for nominal categories (no inherent order). Creates binary columns for each category.

import pandas as pd

# Pandas get_dummies
df_encoded = pd.get_dummies(df, columns=['city', 'department'])

# Scikit-learn OneHotEncoder (better for pipelines)
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded = encoder.fit_transform(df[['city', 'department']])

# Get feature names
feature_names = encoder.get_feature_names_out(['city', 'department'])
df_encoded = pd.DataFrame(encoded, columns=feature_names)

Label Encoding

Use for ordinal categories (has order). Assigns integers to categories.

from sklearn.preprocessing import LabelEncoder

# For ordinal features with order
le = LabelEncoder()
df['education_encoded'] = le.fit_transform(df['education'])
# Example: 'High School'=0, 'Bachelor'=1, 'Master'=2, 'PhD'=3

# For target variable in classification
y_encoded = le.fit_transform(y)

# Decode back to original labels
y_original = le.inverse_transform(y_encoded)

Ordinal Encoding

Use when you want to specify custom ordering.

from sklearn.preprocessing import OrdinalEncoder

# Define custom order
categories = [['Low', 'Medium', 'High']]
encoder = OrdinalEncoder(categories=categories)
df['priority_encoded'] = encoder.fit_transform(df[['priority']])

# Or use pandas map for simple cases
priority_map = {'Low': 0, 'Medium': 1, 'High': 2}
df['priority_encoded'] = df['priority'].map(priority_map)

Target Encoding

Replace categories with target mean. Use carefully to avoid data leakage.

from category_encoders import TargetEncoder

# Use only on training data, then transform test data
encoder = TargetEncoder()
X_train_encoded = encoder.fit_transform(X_train['category'], y_train)
X_test_encoded = encoder.transform(X_test['category'])

# Adds regularization to prevent overfitting on rare categories

5. Feature Transformation

Transform features to better meet model assumptions and improve performance.

Log Transformation

Reduces right skewness, handles wide value ranges.

import numpy as np

# Log transformation (for positive values)
df['income_log'] = np.log1p(df['income'])  # log(1 + x) handles zeros

# Square root transformation (milder than log)
df['price_sqrt'] = np.sqrt(df['price'])

# Box-Cox transformation (finds best power transformation)
from scipy.stats import boxcox
df['transformed'], lambda_param = boxcox(df['skewed_feature'])

Binning/Discretization

Convert continuous variables into categorical bins.

# Equal-width binning
df['age_group'] = pd.cut(df['age'], bins=5, labels=['Very Young', 'Young', 'Middle', 'Senior', 'Elderly'])

# Custom bins
bins = [0, 18, 35, 50, 65, 100]
labels = ['Child', 'Young Adult', 'Adult', 'Middle Age', 'Senior']
df['age_category'] = pd.cut(df['age'], bins=bins, labels=labels)

# Quantile-based binning (equal frequency)
df['income_quartile'] = pd.qcut(df['income'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

6. Handling Duplicates

# Check for duplicates
duplicates = df.duplicated()
print(f"Number of duplicates: {duplicates.sum()}")

# View duplicate rows
print(df[df.duplicated(keep=False)])

# Remove duplicates (keep first occurrence)
df_clean = df.drop_duplicates()

# Remove duplicates based on specific columns
df_clean = df.drop_duplicates(subset=['customer_id', 'date'])

# Remove duplicates keeping last occurrence
df_clean = df.drop_duplicates(keep='last')

Complete Preprocessing Pipeline

Put it all together with scikit-learn's Pipeline for reproducible preprocessing.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define feature types
numeric_features = ['age', 'salary', 'experience']
categorical_features = ['city', 'department', 'education']

# Numeric pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Use in a full pipeline with a model
from sklearn.ensemble import RandomForestClassifier

full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Fit and predict (preprocessing happens automatically)
full_pipeline.fit(X_train, y_train)
predictions = full_pipeline.predict(X_test)

Best Practices

Understand Your Data First: Always perform EDA before preprocessing
Split Before Preprocessing: Split train/test sets first to prevent data leakage
Fit on Training Only: Calculate statistics (mean, std) on training data only
Document Decisions: Keep track of why you chose specific preprocessing steps
Use Pipelines: Ensure consistent preprocessing on new data
Validate Results: Check that preprocessing improved model performance
Handle Time Series Carefully: Don't shuffle or use future data for past predictions
Consider Domain Knowledge: Some outliers might be genuine important events

Common Pitfalls to Avoid

Data Leakage: Fitting scalers on entire dataset including test data
Over-preprocessing: Removing too much data or creating too many features
Ignoring Test Set: Preprocessing that works on training might fail on new data
Wrong Encoding: One-hot encoding ordinal features, or label encoding nominal ones
Scaling Tree Models: Unnecessary for Random Forest, XGBoost, etc.
Forgetting to Save Preprocessing Objects: Can't preprocess new data in production

Master Data Preprocessing with Expert Guidance

Our Data Science program covers comprehensive data preprocessing techniques with hands-on projects. Learn to clean and transform real-world messy data.

Explore Data Science Program

Data Preprocessing