Distributed Tracing with Zipkin & Jaeger in Java

The Problem

User reports: "The checkout is slow." In a monolith, you check one log file. With microservices? The request touches API Gateway → User Service → Inventory Service → Payment Service → Order Service → Notification Service. Which one is slow? Good luck finding it in 6 different log files.

Distributed tracing assigns a unique ID to each request and tracks it across all services.

// Logs WITHOUT tracing
[user-service]  Processing user 123
[order-service] Creating order
[payment-service] Processing payment
// Which request is which? No idea.

// Logs WITH tracing
[user-service]  [trace-id: abc123] Processing user 123
[order-service] [trace-id: abc123] Creating order
[payment-service] [trace-id: abc123] Processing payment
// All from the same request!

Key Concepts

Trace

The entire journey of a request. Has a unique trace ID.

Span

A single operation within a trace (e.g., one service call).

Context Propagation

Passing trace ID from service to service via HTTP headers.

Trace: abc123
├── Span: API Gateway (50ms)
│   └── Span: User Service (20ms)
│       └── Span: Database Query (5ms)
├── Span: Order Service (100ms)
│   ├── Span: Inventory Check (30ms)
│   └── Span: Payment Service (60ms)
└── Span: Notification Service (10ms)

Total: 160ms (some parallel, some sequential)

Setup with Micrometer Tracing

<!-- pom.xml (Spring Boot 3.x) -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-brave</artifactId>
</dependency>
<dependency>
    <groupId>io.zipkin.reporter2</groupId>
    <artifactId>zipkin-reporter-brave</artifactId>
</dependency>

# application.yml
management:
  tracing:
    sampling:
      probability: 1.0  # 100% of requests (use 0.1 for 10% in production)
  zipkin:
    tracing:
      endpoint: http://localhost:9411/api/v2/spans

logging:
  pattern:
    level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"

Run Zipkin

# Using Docker
docker run -d -p 9411:9411 openzipkin/zipkin

# Or download and run
curl -sSL https://zipkin.io/quickstart.sh | bash -s
java -jar zipkin.jar

# Access UI at http://localhost:9411

Automatic Instrumentation

With the dependencies added, tracing works automatically for:

Spring MVC controllers
RestTemplate / WebClient calls
Feign clients
Spring Data repositories
Message queues (Kafka, RabbitMQ)

@RestController
public class OrderController {

    @Autowired
    private RestTemplate restTemplate;

    @GetMapping("/orders/{id}")
    public Order getOrder(@PathVariable Long id) {
        // Trace ID automatically propagated in header
        User user = restTemplate.getForObject(
            "http://user-service/users/{userId}",
            User.class,
            order.getUserId()
        );
        // Both services share the same trace ID!
        return orderService.findById(id);
    }
}

Custom Spans

@Service
public class PaymentService {

    @Autowired
    private Tracer tracer;

    public PaymentResult processPayment(Payment payment) {
        // Create custom span for important operations
        Span span = tracer.nextSpan().name("process-payment").start();
        try (Tracer.SpanInScope ws = tracer.withSpan(span)) {
            // Add useful tags
            span.tag("payment.amount", payment.getAmount().toString());
            span.tag("payment.method", payment.getMethod());

            PaymentResult result = paymentGateway.process(payment);

            span.tag("payment.status", result.getStatus());
            return result;

        } catch (Exception e) {
            span.error(e);  // Record error in trace
            throw e;
        } finally {
            span.end();
        }
    }
}

Using Annotation

@Service
public class InventoryService {

    @NewSpan("check-inventory")
    public boolean checkAvailability(
        @SpanTag("product.id") Long productId,
        @SpanTag("quantity") int quantity) {

        return inventoryRepository.checkStock(productId, quantity);
    }
}

Jaeger Alternative

<!-- Use OpenTelemetry for Jaeger -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>

# Run Jaeger
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

# application.yml
management:
  otlp:
    tracing:
      endpoint: http://localhost:4318/v1/traces

# Access UI at http://localhost:16686

Correlating Logs with Traces

// Logback pattern in logback-spring.xml
<pattern>
    %d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36}
    [traceId=%X{traceId}, spanId=%X{spanId}] - %msg%n
</pattern>

// Now your logs include trace context:
2024-01-15 10:30:45 [http-nio-8080-exec-1] INFO  OrderService
[traceId=abc123, spanId=def456] - Processing order 789

Zipkin vs Jaeger

Feature	Zipkin	Jaeger
Setup	Simpler, single jar	More components
UI	Basic but functional	More features, DAG view
Storage	Memory, MySQL, Cassandra, ES	Memory, Cassandra, ES, Kafka
Best For	Getting started quickly	Production at scale

Best Practices

Sample in production: Don't trace 100% of requests - use 1-10% sampling
Add meaningful tags: user.id, order.id, payment.status - things you'll search for
Name spans well: "db-query-users" not "span1"
Set up alerts: Alert on traces over X duration
Correlate with metrics: Link traces to dashboards

Distributed Tracing