What is XGBoost?

XGBoost (eXtreme Gradient Boosting) is an optimized gradient boosting library designed for speed and performance. It has won numerous Kaggle competitions and is the go-to algorithm for tabular data in both competitions and production systems.

XGBoost builds an ensemble of decision trees sequentially, where each tree corrects the errors of the previous ones. It's known for handling missing values, regularization, and parallel processing.

How Gradient Boosting Works

  1. Start with an initial prediction (often the mean)
  2. Calculate residuals (errors)
  3. Train a tree to predict the residuals
  4. Add the tree's predictions to improve the model
  5. Repeat with new residuals

Each tree is small and weak, but together they form a powerful ensemble.

Basic Usage

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Prepare data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create DMatrix (XGBoost's optimized data structure)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Parameters
params = {
    'objective': 'binary:logistic',  # or 'multi:softmax' for multiclass
    'eval_metric': 'auc',
    'max_depth': 6,
    'learning_rate': 0.1,
    'n_estimators': 100
}

# Train
model = xgb.train(params, dtrain, num_boost_round=100)

# Predict
predictions = model.predict(dtest)
pred_labels = (predictions > 0.5).astype(int)

print(classification_report(y_test, pred_labels))

Sklearn API

from xgboost import XGBClassifier, XGBRegressor

# Classification
clf = XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss'
)

clf.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=10)

# Predictions
predictions = clf.predict(X_test)
probabilities = clf.predict_proba(X_test)

# Regression
reg = XGBRegressor(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1
)
reg.fit(X_train, y_train)

Key Hyperparameters

  • n_estimators: Number of trees (100-1000)
  • max_depth: Tree depth (3-10, deeper = more complex)
  • learning_rate: Step size (0.01-0.3, lower = more trees needed)
  • subsample: Row sampling (0.5-1.0)
  • colsample_bytree: Column sampling (0.5-1.0)
  • min_child_weight: Minimum sum of weights in a leaf
  • gamma: Minimum loss reduction for split
  • reg_alpha: L1 regularization
  • reg_lambda: L2 regularization

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import optuna

# Grid Search
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [100, 200, 300],
    'subsample': [0.8, 0.9, 1.0]
}

grid_search = GridSearchCV(
    XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    param_grid, cv=5, scoring='roc_auc', n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")

# Optuna (Bayesian optimization)
def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
    }

    model = XGBClassifier(**params, use_label_encoder=False, eval_metric='logloss')
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    return accuracy_score(y_test, predictions)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

Feature Importance

import matplotlib.pyplot as plt

# Built-in feature importance
xgb.plot_importance(model, max_num_features=20)
plt.tight_layout()
plt.show()

# Get importance scores
importance = model.get_score(importance_type='gain')  # or 'weight', 'cover'

# SHAP values for better interpretability
import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot
shap.summary_plot(shap_values, X_test)

# Single prediction explanation
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0])

LightGBM & CatBoost Alternatives

# LightGBM - faster, handles large datasets
import lightgbm as lgb

lgb_model = lgb.LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    num_leaves=31,
    max_depth=-1
)
lgb_model.fit(X_train, y_train)

# CatBoost - handles categorical features natively
from catboost import CatBoostClassifier

cat_model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=6,
    cat_features=categorical_columns  # Pass categorical column indices/names
)
cat_model.fit(X_train, y_train, verbose=10)

Best Practices

  • Start with defaults: XGBoost defaults are reasonable
  • Use early stopping: Prevent overfitting
  • Lower learning rate, more trees: Often improves performance
  • Feature engineering matters: Good features beat hyperparameter tuning
  • Cross-validation: Always validate properly
  • Handle imbalanced data: Use scale_pos_weight parameter

Master XGBoost with Expert Mentorship

Our Data Science program covers XGBoost and gradient boosting in depth. Win competitions and build production models with expert guidance.

Explore Data Science Program

Related Articles