What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is the critical first step in any data science project. It's the process of analyzing and visualizing data to summarize its main characteristics, discover patterns, spot anomalies, test hypotheses, and check assumptions.
John Tukey, who pioneered EDA in the 1970s, emphasized that "The greatest value of a picture is when it forces us to notice what we never expected to see." EDA is about letting the data tell its story before you impose your models on it.
Why EDA is Essential
- Understand Data Structure: Learn about features, types, and relationships
- Detect Data Quality Issues: Find missing values, outliers, and inconsistencies
- Discover Patterns: Identify trends, correlations, and distributions
- Generate Hypotheses: Form questions and theories to test
- Guide Feature Engineering: Decide which features to create or transform
- Select Models: Choose appropriate algorithms based on data characteristics
- Communicate Insights: Create visualizations that tell data stories
1. Initial Data Inspection
Start by understanding the basic structure and content of your data.
Load and Examine Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
# Load data
df = pd.read_csv('data.csv')
# First look at the data
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
print(df.sample(10)) # Random 10 rows
# Dataset shape
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
# Column information
print(df.info())
print(df.dtypes)
# Column names
print(df.columns.tolist())
Basic Statistics
# Summary statistics for numerical features
print(df.describe())
# Include categorical features
print(df.describe(include='all'))
# Custom statistics
print(df.describe(percentiles=[.1, .25, .5, .75, .9, .95, .99]))
# Statistics for specific columns
print(df[['age', 'salary', 'score']].describe())
2. Data Quality Assessment
Missing Values Analysis
# Count missing values
missing = df.isnull().sum()
print(missing[missing > 0])
# Percentage of missing values
missing_percent = (df.isnull().sum() / len(df)) * 100
print(missing_percent[missing_percent > 0].sort_values(ascending=False))
# Visualize missing data
import missingno as msno
# Bar chart of missing values
msno.bar(df)
plt.show()
# Matrix showing patterns of missingness
msno.matrix(df)
plt.show()
# Heatmap of missing value correlations
msno.heatmap(df)
plt.show()
Duplicate Detection
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")
# View duplicate rows
print(df[df.duplicated(keep=False)])
# Check duplicates on specific columns
duplicate_ids = df.duplicated(subset=['customer_id']).sum()
print(f"Duplicate customer IDs: {duplicate_ids}")
Data Type Issues
# Check for mixed data types in columns
for col in df.columns:
unique_types = df[col].apply(type).unique()
if len(unique_types) > 1:
print(f"{col} has mixed types: {unique_types}")
# Identify numeric columns stored as strings
for col in df.select_dtypes(include='object'):
try:
pd.to_numeric(df[col])
print(f"{col} can be converted to numeric")
except:
pass
3. Univariate Analysis
Analyze each variable individually to understand its distribution and characteristics.
Numerical Features
# Distribution plots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Histogram
df['age'].hist(bins=30, ax=axes[0, 0])
axes[0, 0].set_title('Age Distribution')
# Density plot (KDE)
df['salary'].plot(kind='density', ax=axes[0, 1])
axes[0, 1].set_title('Salary Density')
# Box plot (shows outliers)
df.boxplot(column='score', ax=axes[1, 0])
axes[1, 0].set_title('Score Box Plot')
# Violin plot (combines box plot and KDE)
sns.violinplot(y=df['experience'], ax=axes[1, 1])
axes[1, 1].set_title('Experience Violin Plot')
plt.tight_layout()
plt.show()
# Multiple distributions at once
df[['age', 'salary', 'score', 'experience']].hist(
bins=30, figsize=(15, 10), edgecolor='black'
)
plt.tight_layout()
plt.show()
Categorical Features
# Value counts
print(df['category'].value_counts())
print(df['category'].value_counts(normalize=True)) # Percentages
# Bar plots
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# Count plot
sns.countplot(data=df, x='category', ax=axes[0])
axes[0].set_title('Category Distribution')
axes[0].tick_params(axis='x', rotation=45)
# Pie chart
df['category'].value_counts().plot(kind='pie', autopct='%1.1f%%', ax=axes[1])
axes[1].set_title('Category Proportions')
plt.tight_layout()
plt.show()
# For many categories, show top N
top_categories = df['city'].value_counts().head(10)
top_categories.plot(kind='barh', figsize=(10, 6))
plt.title('Top 10 Cities')
plt.xlabel('Count')
plt.show()
Statistical Tests
from scipy import stats
# Test for normality
statistic, p_value = stats.normaltest(df['salary'].dropna())
print(f"Normal test p-value: {p_value}")
if p_value < 0.05:
print("Data is NOT normally distributed")
else:
print("Data appears normally distributed")
# Skewness and Kurtosis
print(f"Skewness: {df['salary'].skew()}")
print(f"Kurtosis: {df['salary'].kurt()}")
# Visual normality check: Q-Q plot
from scipy.stats import probplot
probplot(df['salary'], dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.show()
4. Bivariate Analysis
Explore relationships between two variables.
Numerical vs Numerical
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['experience'], df['salary'], alpha=0.5)
plt.xlabel('Experience (years)')
plt.ylabel('Salary')
plt.title('Experience vs Salary')
plt.show()
# Scatter plot with regression line
sns.regplot(data=df, x='experience', y='salary')
plt.show()
# Joint plot (scatter + distributions)
sns.jointplot(data=df, x='experience', y='salary', kind='scatter')
plt.show()
# Hexbin plot (for large datasets)
df.plot(kind='hexbin', x='experience', y='salary', gridsize=20, figsize=(10, 6))
plt.show()
# Correlation coefficient
correlation = df['experience'].corr(df['salary'])
print(f"Correlation: {correlation:.3f}")
Categorical vs Numerical
# Box plots by category
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='department', y='salary')
plt.xticks(rotation=45)
plt.title('Salary by Department')
plt.show()
# Violin plots (shows distribution shape)
sns.violinplot(data=df, x='education', y='salary')
plt.show()
# Strip plot (shows all points)
sns.stripplot(data=df, x='department', y='salary', alpha=0.5)
plt.show()
# Swarm plot (non-overlapping points, good for small datasets)
sns.swarmplot(data=df, x='department', y='salary', size=3)
plt.show()
# Statistical summary by category
print(df.groupby('department')['salary'].describe())
Categorical vs Categorical
# Crosstab
crosstab = pd.crosstab(df['department'], df['education'])
print(crosstab)
# Normalized crosstab (proportions)
crosstab_norm = pd.crosstab(df['department'], df['education'], normalize='index')
print(crosstab_norm)
# Heatmap of crosstab
plt.figure(figsize=(10, 6))
sns.heatmap(crosstab, annot=True, fmt='d', cmap='YlOrRd')
plt.title('Department vs Education')
plt.show()
# Stacked bar chart
crosstab.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.title('Department vs Education Distribution')
plt.legend(title='Education')
plt.xticks(rotation=45)
plt.show()
# Grouped bar chart
crosstab.plot(kind='bar', figsize=(12, 6))
plt.title('Department vs Education')
plt.legend(title='Education')
plt.xticks(rotation=45)
plt.show()
5. Multivariate Analysis
Analyze relationships among multiple variables simultaneously.
Correlation Analysis
# Correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)
# Heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm',
center=0, square=True, linewidths=1)
plt.title('Correlation Matrix')
plt.show()
# Find highly correlated features
high_corr = correlation_matrix.abs() > 0.8
high_corr = high_corr[high_corr == True]
print("Highly correlated features:")
print(high_corr)
# Correlation with target variable
target_corr = df.corr()['target'].sort_values(ascending=False)
print(target_corr)
Pair Plots
# Pair plot for selected features
selected_features = ['age', 'experience', 'salary', 'score', 'target']
sns.pairplot(df[selected_features])
plt.show()
# Pair plot with categorical hue
sns.pairplot(df[selected_features], hue='target')
plt.show()
# Customized pair plot
sns.pairplot(df[selected_features],
diag_kind='kde', # KDE on diagonal instead of histogram
plot_kws={'alpha': 0.6})
plt.show()
Multi-dimensional Visualizations
# Scatter plot with size and color
plt.figure(figsize=(12, 8))
scatter = plt.scatter(df['experience'], df['salary'],
s=df['score']*10, # Size by score
c=df['age'], # Color by age
alpha=0.6, cmap='viridis')
plt.colorbar(scatter, label='Age')
plt.xlabel('Experience')
plt.ylabel('Salary')
plt.title('Multi-dimensional Scatter Plot')
plt.show()
# Bubble chart with seaborn
sns.scatterplot(data=df, x='experience', y='salary',
size='score', hue='department',
sizes=(50, 500), alpha=0.6)
plt.show()
# 3D scatter plot
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['age'], df['experience'], df['salary'],
c=df['score'], cmap='viridis')
ax.set_xlabel('Age')
ax.set_ylabel('Experience')
ax.set_zlabel('Salary')
plt.show()
6. Outlier Detection
# Box plots for outlier visualization
df[['age', 'salary', 'score']].boxplot(figsize=(12, 6))
plt.show()
# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df[['age', 'salary', 'score']].dropna()))
outliers_z = (z_scores > 3).any(axis=1)
print(f"Outliers by Z-score: {outliers_z.sum()}")
# IQR method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = (df['salary'] < lower_bound) | (df['salary'] > upper_bound)
print(f"Outliers by IQR: {outliers_iqr.sum()}")
# Isolation Forest (ML-based)
from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
outliers_ml = iso_forest.fit_predict(df[['age', 'salary', 'score']].dropna())
print(f"Outliers by Isolation Forest: {(outliers_ml == -1).sum()}")
7. Advanced EDA Techniques
Feature Interactions
# Create interaction features
df['salary_per_year_experience'] = df['salary'] / (df['experience'] + 1)
# Visualize interactions
sns.scatterplot(data=df, x='age', y='salary', hue='education', style='department')
plt.show()
# Facet grid for multiple subplots
g = sns.FacetGrid(df, col='department', row='education', height=4)
g.map(plt.scatter, 'experience', 'salary', alpha=0.5)
g.add_legend()
plt.show()
Time Series Analysis (if applicable)
# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
# Set as index
df.set_index('date', inplace=True)
# Time-based aggregations
monthly_avg = df.resample('M')['value'].mean()
monthly_avg.plot(figsize=(12, 6))
plt.title('Monthly Average Values')
plt.show()
# Trend and seasonality
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(df['value'], model='additive', period=12)
decomposition.plot()
plt.show()
8. Automated EDA Tools
Use libraries that automate comprehensive EDA reports.
Pandas Profiling
from ydata_profiling import ProfileReport
# Generate comprehensive report
profile = ProfileReport(df, title="EDA Report", explorative=True)
# Save to HTML
profile.to_file("eda_report.html")
# View in notebook
profile.to_widgets()
# Access specific sections
print(profile.get_description())
Sweetviz
import sweetviz as sv
# Generate report
report = sv.analyze(df)
report.show_html("sweetviz_report.html")
# Compare datasets (e.g., train vs test)
report = sv.compare([df_train, "Training"], [df_test, "Test"])
report.show_html("comparison_report.html")
AutoViz
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
dft = AV.AutoViz('data.csv', sep=',', depVar='target',
dfte=None, header=0, verbose=1,
lowess=False, chart_format='png',
max_rows_analyzed=150000, max_cols_analyzed=30)
Best Practices for EDA
- Start Simple: Begin with basic statistics before complex visualizations
- Ask Questions: Let curiosity guide your analysis
- Document Findings: Keep notes on insights and decisions
- Use Multiple Visualizations: Different plots reveal different patterns
- Check Assumptions: Verify what you think you know about the data
- Look for Stories: Find narratives in the data to communicate
- Iterate: EDA is not linear - revisit earlier steps with new insights
- Consider Context: Domain knowledge is crucial for interpretation
- Be Skeptical: Question apparent patterns - they might be artifacts
Common EDA Workflow
- Load and Inspect: Import data, check shape and types
- Clean Data: Handle missing values and duplicates
- Univariate Analysis: Examine each feature individually
- Bivariate Analysis: Explore relationships between pairs
- Multivariate Analysis: Understand complex interactions
- Feature Engineering Ideas: Note transformations and new features
- Document Insights: Summarize findings and decisions
- Create Report: Compile visualizations and conclusions
Master EDA with Expert Mentorship
Our Data Science program teaches comprehensive EDA techniques with real-world datasets. Learn to extract insights that drive business decisions.
Explore Data Science Program