Docker for Machine Learning: Complete Guide

Why Docker for ML?

Docker solves the "it works on my machine" problem. It packages your model, code, and all dependencies into a container that runs identically everywhere.

Reproducibility: Same environment in dev, test, and production
Portability: Run anywhere Docker runs
Isolation: No dependency conflicts
Scalability: Easy to scale with Kubernetes

Docker Basics

# Check Docker installation
docker --version

# Pull an image
docker pull python:3.11-slim

# Run a container
docker run -it python:3.11-slim python --version

# List running containers
docker ps

# List all containers
docker ps -a

# Stop a container
docker stop container_id

# Remove a container
docker rm container_id

# List images
docker images

# Remove an image
docker rmi image_id

Basic ML Dockerfile

# Dockerfile
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements first (for caching)
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Expose port
EXPOSE 8000

# Run the application
CMD ["python", "app.py"]

Multi-Stage Builds for Smaller Images

# Multi-stage Dockerfile for production
FROM python:3.11-slim as builder

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y build-essential

# Create virtual environment
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Production stage
FROM python:3.11-slim as production

WORKDIR /app

# Copy virtual environment from builder
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Copy application
COPY . .

# Create non-root user
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser

EXPOSE 8000
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:8000", "app:app"]

GPU Support with NVIDIA Docker

# GPU-enabled Dockerfile
FROM nvidia/cuda:11.8-cudnn8-runtime-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python3", "train.py"]

# Build
docker build -t ml-gpu .

# Run with GPU access
docker run --gpus all ml-gpu

# Or specific GPUs
docker run --gpus '"device=0,1"' ml-gpu

Docker Compose for ML Workflows

# docker-compose.yml
version: '3.8'

services:
  api:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
    volumes:
      - ./models:/app/models
    environment:
      - MODEL_PATH=/app/models/model.pkl
      - LOG_LEVEL=INFO
    depends_on:
      - redis
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  worker:
    build: .
    command: celery -A tasks worker --loglevel=info
    depends_on:
      - redis

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.9.0
    ports:
      - "5000:5000"
    volumes:
      - ./mlruns:/mlruns
    command: mlflow server --host 0.0.0.0 --backend-store-uri sqlite:///mlflow.db

# Commands
docker-compose up -d
docker-compose logs -f api
docker-compose down

Optimizing Docker Images for ML

# .dockerignore
__pycache__
*.pyc
*.pyo
.git
.gitignore
*.md
.env
.venv
venv
notebooks/
tests/
*.ipynb
.pytest_cache
.coverage
htmlcov/
data/raw/
*.log

# Optimized requirements installation
# Install heavy packages first (better caching)
COPY requirements-base.txt .
RUN pip install --no-cache-dir -r requirements-base.txt

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Use slim or alpine images when possible
FROM python:3.11-slim  # ~150MB vs ~900MB for full
FROM python:3.11-alpine  # ~50MB (but may have compatibility issues)

# Clear pip cache
RUN pip install --no-cache-dir -r requirements.txt

# Combine RUN commands
RUN apt-get update && apt-get install -y \
    package1 \
    package2 \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

Development vs Production

# Dockerfile.dev
FROM python:3.11-slim

WORKDIR /app

COPY requirements-dev.txt .
RUN pip install --no-cache-dir -r requirements-dev.txt

# Mount code as volume for hot reload
CMD ["uvicorn", "app:app", "--reload", "--host", "0.0.0.0"]

# docker-compose.dev.yml
version: '3.8'
services:
  api:
    build:
      context: .
      dockerfile: Dockerfile.dev
    volumes:
      - .:/app  # Mount for hot reload
      - /app/.venv  # Exclude venv
    ports:
      - "8000:8000"
    environment:
      - DEBUG=true

# Run development environment
docker-compose -f docker-compose.dev.yml up

Serving Models with Docker

# Complete ML API Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and code
COPY models/ ./models/
COPY src/ ./src/
COPY app.py .

# Set environment variables
ENV MODEL_PATH=/app/models/model.joblib
ENV PORT=8000

EXPOSE $PORT

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:$PORT/health || exit 1

# Run with Gunicorn
CMD gunicorn app:app \
    --workers 4 \
    --worker-class uvicorn.workers.UvicornWorker \
    --bind 0.0.0.0:$PORT \
    --timeout 120 \
    --access-logfile - \
    --error-logfile -

Container Registry and Deployment

# Build and tag
docker build -t myapp:v1.0 .

# Tag for registry
docker tag myapp:v1.0 registry.example.com/myapp:v1.0

# Push to registry
docker push registry.example.com/myapp:v1.0

# AWS ECR
aws ecr get-login-password --region us-east-1 | \
    docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com

docker tag myapp:v1.0 123456789.dkr.ecr.us-east-1.amazonaws.com/myapp:v1.0
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/myapp:v1.0

# Google Container Registry
gcloud auth configure-docker
docker tag myapp:v1.0 gcr.io/my-project/myapp:v1.0
docker push gcr.io/my-project/myapp:v1.0

# Docker Hub
docker login
docker tag myapp:v1.0 username/myapp:v1.0
docker push username/myapp:v1.0

Best Practices

Pin versions: Use specific image tags, not :latest
Non-root user: Run containers as non-root for security
Health checks: Always include health check endpoints
Logging: Log to stdout/stderr for container orchestrators
Secrets: Use environment variables or secrets management
Layer caching: Order Dockerfile commands by change frequency

Master ML Deployment

Our Data Science program covers Docker, Kubernetes, and cloud deployment. Learn to deploy production-ready ML systems.

Explore Data Science Program

Docker for Machine Learning