Exploratory Data Analysis (EDA) with Python

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the critical first step in any data science project. It's the process of analyzing and visualizing data to summarize its main characteristics, discover patterns, spot anomalies, test hypotheses, and check assumptions.

John Tukey, who pioneered EDA in the 1970s, emphasized that "The greatest value of a picture is when it forces us to notice what we never expected to see." EDA is about letting the data tell its story before you impose your models on it.

Why EDA is Essential

Understand Data Structure: Learn about features, types, and relationships
Detect Data Quality Issues: Find missing values, outliers, and inconsistencies
Discover Patterns: Identify trends, correlations, and distributions
Generate Hypotheses: Form questions and theories to test
Guide Feature Engineering: Decide which features to create or transform
Select Models: Choose appropriate algorithms based on data characteristics
Communicate Insights: Create visualizations that tell data stories

1. Initial Data Inspection

Start by understanding the basic structure and content of your data.

Load and Examine Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Load data
df = pd.read_csv('data.csv')

# First look at the data
print(df.head())        # First 5 rows
print(df.tail())        # Last 5 rows
print(df.sample(10))    # Random 10 rows

# Dataset shape
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

# Column information
print(df.info())
print(df.dtypes)

# Column names
print(df.columns.tolist())

Basic Statistics

# Summary statistics for numerical features
print(df.describe())

# Include categorical features
print(df.describe(include='all'))

# Custom statistics
print(df.describe(percentiles=[.1, .25, .5, .75, .9, .95, .99]))

# Statistics for specific columns
print(df[['age', 'salary', 'score']].describe())

2. Data Quality Assessment

Missing Values Analysis

# Count missing values
missing = df.isnull().sum()
print(missing[missing > 0])

# Percentage of missing values
missing_percent = (df.isnull().sum() / len(df)) * 100
print(missing_percent[missing_percent > 0].sort_values(ascending=False))

# Visualize missing data
import missingno as msno

# Bar chart of missing values
msno.bar(df)
plt.show()

# Matrix showing patterns of missingness
msno.matrix(df)
plt.show()

# Heatmap of missing value correlations
msno.heatmap(df)
plt.show()

Duplicate Detection

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# View duplicate rows
print(df[df.duplicated(keep=False)])

# Check duplicates on specific columns
duplicate_ids = df.duplicated(subset=['customer_id']).sum()
print(f"Duplicate customer IDs: {duplicate_ids}")

Data Type Issues

# Check for mixed data types in columns
for col in df.columns:
    unique_types = df[col].apply(type).unique()
    if len(unique_types) > 1:
        print(f"{col} has mixed types: {unique_types}")

# Identify numeric columns stored as strings
for col in df.select_dtypes(include='object'):
    try:
        pd.to_numeric(df[col])
        print(f"{col} can be converted to numeric")
    except:
        pass

3. Univariate Analysis

Analyze each variable individually to understand its distribution and characteristics.

Numerical Features

# Distribution plots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Histogram
df['age'].hist(bins=30, ax=axes[0, 0])
axes[0, 0].set_title('Age Distribution')

# Density plot (KDE)
df['salary'].plot(kind='density', ax=axes[0, 1])
axes[0, 1].set_title('Salary Density')

# Box plot (shows outliers)
df.boxplot(column='score', ax=axes[1, 0])
axes[1, 0].set_title('Score Box Plot')

# Violin plot (combines box plot and KDE)
sns.violinplot(y=df['experience'], ax=axes[1, 1])
axes[1, 1].set_title('Experience Violin Plot')

plt.tight_layout()
plt.show()

# Multiple distributions at once
df[['age', 'salary', 'score', 'experience']].hist(
    bins=30, figsize=(15, 10), edgecolor='black'
)
plt.tight_layout()
plt.show()

Categorical Features

# Value counts
print(df['category'].value_counts())
print(df['category'].value_counts(normalize=True))  # Percentages

# Bar plots
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Count plot
sns.countplot(data=df, x='category', ax=axes[0])
axes[0].set_title('Category Distribution')
axes[0].tick_params(axis='x', rotation=45)

# Pie chart
df['category'].value_counts().plot(kind='pie', autopct='%1.1f%%', ax=axes[1])
axes[1].set_title('Category Proportions')

plt.tight_layout()
plt.show()

# For many categories, show top N
top_categories = df['city'].value_counts().head(10)
top_categories.plot(kind='barh', figsize=(10, 6))
plt.title('Top 10 Cities')
plt.xlabel('Count')
plt.show()

Statistical Tests

from scipy import stats

# Test for normality
statistic, p_value = stats.normaltest(df['salary'].dropna())
print(f"Normal test p-value: {p_value}")
if p_value < 0.05:
    print("Data is NOT normally distributed")
else:
    print("Data appears normally distributed")

# Skewness and Kurtosis
print(f"Skewness: {df['salary'].skew()}")
print(f"Kurtosis: {df['salary'].kurt()}")

# Visual normality check: Q-Q plot
from scipy.stats import probplot

probplot(df['salary'], dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.show()

4. Bivariate Analysis

Explore relationships between two variables.

Numerical vs Numerical

# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['experience'], df['salary'], alpha=0.5)
plt.xlabel('Experience (years)')
plt.ylabel('Salary')
plt.title('Experience vs Salary')
plt.show()

# Scatter plot with regression line
sns.regplot(data=df, x='experience', y='salary')
plt.show()

# Joint plot (scatter + distributions)
sns.jointplot(data=df, x='experience', y='salary', kind='scatter')
plt.show()

# Hexbin plot (for large datasets)
df.plot(kind='hexbin', x='experience', y='salary', gridsize=20, figsize=(10, 6))
plt.show()

# Correlation coefficient
correlation = df['experience'].corr(df['salary'])
print(f"Correlation: {correlation:.3f}")

Categorical vs Numerical

# Box plots by category
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='department', y='salary')
plt.xticks(rotation=45)
plt.title('Salary by Department')
plt.show()

# Violin plots (shows distribution shape)
sns.violinplot(data=df, x='education', y='salary')
plt.show()

# Strip plot (shows all points)
sns.stripplot(data=df, x='department', y='salary', alpha=0.5)
plt.show()

# Swarm plot (non-overlapping points, good for small datasets)
sns.swarmplot(data=df, x='department', y='salary', size=3)
plt.show()

# Statistical summary by category
print(df.groupby('department')['salary'].describe())

Categorical vs Categorical

# Crosstab
crosstab = pd.crosstab(df['department'], df['education'])
print(crosstab)

# Normalized crosstab (proportions)
crosstab_norm = pd.crosstab(df['department'], df['education'], normalize='index')
print(crosstab_norm)

# Heatmap of crosstab
plt.figure(figsize=(10, 6))
sns.heatmap(crosstab, annot=True, fmt='d', cmap='YlOrRd')
plt.title('Department vs Education')
plt.show()

# Stacked bar chart
crosstab.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.title('Department vs Education Distribution')
plt.legend(title='Education')
plt.xticks(rotation=45)
plt.show()

# Grouped bar chart
crosstab.plot(kind='bar', figsize=(12, 6))
plt.title('Department vs Education')
plt.legend(title='Education')
plt.xticks(rotation=45)
plt.show()

5. Multivariate Analysis

Analyze relationships among multiple variables simultaneously.

Correlation Analysis

# Correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)

# Heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, square=True, linewidths=1)
plt.title('Correlation Matrix')
plt.show()

# Find highly correlated features
high_corr = correlation_matrix.abs() > 0.8
high_corr = high_corr[high_corr == True]
print("Highly correlated features:")
print(high_corr)

# Correlation with target variable
target_corr = df.corr()['target'].sort_values(ascending=False)
print(target_corr)

Pair Plots

# Pair plot for selected features
selected_features = ['age', 'experience', 'salary', 'score', 'target']
sns.pairplot(df[selected_features])
plt.show()

# Pair plot with categorical hue
sns.pairplot(df[selected_features], hue='target')
plt.show()

# Customized pair plot
sns.pairplot(df[selected_features],
             diag_kind='kde',  # KDE on diagonal instead of histogram
             plot_kws={'alpha': 0.6})
plt.show()

Multi-dimensional Visualizations

# Scatter plot with size and color
plt.figure(figsize=(12, 8))
scatter = plt.scatter(df['experience'], df['salary'],
                     s=df['score']*10,  # Size by score
                     c=df['age'],        # Color by age
                     alpha=0.6, cmap='viridis')
plt.colorbar(scatter, label='Age')
plt.xlabel('Experience')
plt.ylabel('Salary')
plt.title('Multi-dimensional Scatter Plot')
plt.show()

# Bubble chart with seaborn
sns.scatterplot(data=df, x='experience', y='salary',
                size='score', hue='department',
                sizes=(50, 500), alpha=0.6)
plt.show()

# 3D scatter plot
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['age'], df['experience'], df['salary'],
           c=df['score'], cmap='viridis')
ax.set_xlabel('Age')
ax.set_ylabel('Experience')
ax.set_zlabel('Salary')
plt.show()

6. Outlier Detection

# Box plots for outlier visualization
df[['age', 'salary', 'score']].boxplot(figsize=(12, 6))
plt.show()

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df[['age', 'salary', 'score']].dropna()))
outliers_z = (z_scores > 3).any(axis=1)
print(f"Outliers by Z-score: {outliers_z.sum()}")

# IQR method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = (df['salary'] < lower_bound) | (df['salary'] > upper_bound)
print(f"Outliers by IQR: {outliers_iqr.sum()}")

# Isolation Forest (ML-based)
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.1, random_state=42)
outliers_ml = iso_forest.fit_predict(df[['age', 'salary', 'score']].dropna())
print(f"Outliers by Isolation Forest: {(outliers_ml == -1).sum()}")

7. Advanced EDA Techniques

Feature Interactions

# Create interaction features
df['salary_per_year_experience'] = df['salary'] / (df['experience'] + 1)

# Visualize interactions
sns.scatterplot(data=df, x='age', y='salary', hue='education', style='department')
plt.show()

# Facet grid for multiple subplots
g = sns.FacetGrid(df, col='department', row='education', height=4)
g.map(plt.scatter, 'experience', 'salary', alpha=0.5)
g.add_legend()
plt.show()

Time Series Analysis (if applicable)

# Convert to datetime
df['date'] = pd.to_datetime(df['date'])

# Set as index
df.set_index('date', inplace=True)

# Time-based aggregations
monthly_avg = df.resample('M')['value'].mean()
monthly_avg.plot(figsize=(12, 6))
plt.title('Monthly Average Values')
plt.show()

# Trend and seasonality
from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(df['value'], model='additive', period=12)
decomposition.plot()
plt.show()

8. Automated EDA Tools

Use libraries that automate comprehensive EDA reports.

Pandas Profiling

from ydata_profiling import ProfileReport

# Generate comprehensive report
profile = ProfileReport(df, title="EDA Report", explorative=True)

# Save to HTML
profile.to_file("eda_report.html")

# View in notebook
profile.to_widgets()

# Access specific sections
print(profile.get_description())

Sweetviz

import sweetviz as sv

# Generate report
report = sv.analyze(df)
report.show_html("sweetviz_report.html")

# Compare datasets (e.g., train vs test)
report = sv.compare([df_train, "Training"], [df_test, "Test"])
report.show_html("comparison_report.html")

AutoViz

from autoviz.AutoViz_Class import AutoViz_Class

AV = AutoViz_Class()
dft = AV.AutoViz('data.csv', sep=',', depVar='target',
                 dfte=None, header=0, verbose=1,
                 lowess=False, chart_format='png',
                 max_rows_analyzed=150000, max_cols_analyzed=30)

Best Practices for EDA

Start Simple: Begin with basic statistics before complex visualizations
Ask Questions: Let curiosity guide your analysis
Document Findings: Keep notes on insights and decisions
Use Multiple Visualizations: Different plots reveal different patterns
Check Assumptions: Verify what you think you know about the data
Look for Stories: Find narratives in the data to communicate
Iterate: EDA is not linear - revisit earlier steps with new insights
Consider Context: Domain knowledge is crucial for interpretation
Be Skeptical: Question apparent patterns - they might be artifacts

Common EDA Workflow

Load and Inspect: Import data, check shape and types
Clean Data: Handle missing values and duplicates
Univariate Analysis: Examine each feature individually
Bivariate Analysis: Explore relationships between pairs
Multivariate Analysis: Understand complex interactions
Feature Engineering Ideas: Note transformations and new features
Document Insights: Summarize findings and decisions
Create Report: Compile visualizations and conclusions

Master EDA with Expert Mentorship

Our Data Science program teaches comprehensive EDA techniques with real-world datasets. Learn to extract insights that drive business decisions.

Explore Data Science Program

Exploratory Data Analysis (EDA)