Circuit Breaker Pattern with Resilience4j in Java

The Problem

Imagine Service A calls Service B, which calls Service C. Service C is down. What happens? Service B keeps trying and timing out. Service A waits for B. Every request piles up. Soon, your entire system is frozen - one failed service took everything down.

A circuit breaker is like an electrical circuit breaker: when things fail too often, it "trips" and stops trying, returning a fallback response instead.

// WITHOUT Circuit Breaker
User user = userService.getUser(id);  // Waits 30 seconds... timeout
User user = userService.getUser(id);  // Waits 30 seconds again...
// Thread pool exhausted, system crashes

// WITH Circuit Breaker
User user = userService.getUser(id);  // Fails
User user = userService.getUser(id);  // Fails
User user = userService.getUser(id);  // Circuit OPENS!
User user = userService.getUser(id);  // Returns fallback immediately

Circuit Breaker States

CLOSED (Normal)
    ↓ failures exceed threshold
OPEN (Failing fast - returns fallback immediately)
    ↓ wait duration expires
HALF-OPEN (Testing - allows limited requests)
    ↓ requests succeed → CLOSED
    ↓ requests fail → OPEN

Closed

Normal operation. Requests pass through. Failures are counted.

Open

Too many failures. Requests fail immediately with fallback.

Half-Open

Testing recovery. Limited requests allowed to check if service is back.

Resilience4j Setup

<!-- pom.xml -->
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-aop</artifactId>
</dependency>

Basic Circuit Breaker

@Service
public class UserService {

    @Autowired
    private UserClient userClient;

    @CircuitBreaker(name = "userService", fallbackMethod = "getUserFallback")
    public User getUser(Long id) {
        return userClient.getUser(id);  // Calls external service
    }

    // Fallback when circuit is open or call fails
    public User getUserFallback(Long id, Exception ex) {
        log.warn("Circuit breaker triggered for user {}: {}", id, ex.getMessage());
        return new User(id, "Unknown", "Unavailable");  // Default user
    }
}

Configuration

# application.yml
resilience4j:
  circuitbreaker:
    instances:
      userService:
        # When to open the circuit
        failure-rate-threshold: 50        # Open if 50% of calls fail
        slow-call-rate-threshold: 100     # Or if 100% of calls are slow
        slow-call-duration-threshold: 2s  # What counts as "slow"

        # How many calls to evaluate
        sliding-window-type: COUNT_BASED
        sliding-window-size: 10           # Last 10 calls
        minimum-number-of-calls: 5        # Need at least 5 calls to evaluate

        # Recovery
        wait-duration-in-open-state: 30s  # Wait before trying again
        permitted-number-of-calls-in-half-open-state: 3  # Test calls

        # What counts as failure
        record-exceptions:
          - java.io.IOException
          - java.net.SocketTimeoutException
        ignore-exceptions:
          - com.example.BusinessException

Retry Pattern

@Service
public class PaymentService {

    @Retry(name = "paymentService", fallbackMethod = "paymentFallback")
    public PaymentResult processPayment(Payment payment) {
        return paymentClient.process(payment);
    }

    public PaymentResult paymentFallback(Payment payment, Exception ex) {
        log.error("Payment failed after retries: {}", ex.getMessage());
        return PaymentResult.pending("Will retry later");
    }
}

# application.yml
resilience4j:
  retry:
    instances:
      paymentService:
        max-attempts: 3
        wait-duration: 1s
        exponential-backoff-multiplier: 2  # 1s, 2s, 4s
        retry-exceptions:
          - java.io.IOException
        ignore-exceptions:
          - com.example.InvalidPaymentException

Rate Limiter

@Service
public class ApiService {

    @RateLimiter(name = "apiService", fallbackMethod = "rateLimitFallback")
    public Response callExternalApi(Request request) {
        return externalClient.call(request);
    }

    public Response rateLimitFallback(Request request, Exception ex) {
        throw new TooManyRequestsException("Rate limit exceeded. Try later.");
    }
}

resilience4j:
  ratelimiter:
    instances:
      apiService:
        limit-for-period: 10          # 10 requests
        limit-refresh-period: 1s       # per second
        timeout-duration: 0s           # Don't wait, fail immediately

Bulkhead Pattern

Isolate resources to prevent one slow service from consuming all threads.

@Service
public class OrderService {

    @Bulkhead(name = "orderService", type = Bulkhead.Type.THREADPOOL)
    public Order processOrder(Order order) {
        return orderProcessor.process(order);
    }
}

resilience4j:
  bulkhead:
    instances:
      orderService:
        max-concurrent-calls: 10      # Max 10 concurrent requests
        max-wait-duration: 0s

  thread-pool-bulkhead:
    instances:
      orderService:
        max-thread-pool-size: 10
        core-thread-pool-size: 5
        queue-capacity: 20

Combining Patterns

@Service
public class ProductService {

    // Order matters! Retry → CircuitBreaker → RateLimiter → Bulkhead
    @Retry(name = "productService")
    @CircuitBreaker(name = "productService", fallbackMethod = "getProductFallback")
    @RateLimiter(name = "productService")
    @Bulkhead(name = "productService")
    public Product getProduct(Long id) {
        return productClient.getProduct(id);
    }

    public Product getProductFallback(Long id, Exception ex) {
        return productCache.get(id)  // Try cache first
            .orElse(Product.unavailable(id));
    }
}

Monitoring with Actuator

<!-- pom.xml -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health, circuitbreakers, retries, ratelimiters
  health:
    circuitbreakers:
      enabled: true

# Check circuit breaker status
GET /actuator/circuitbreakers

# Response:
{
  "circuitBreakers": {
    "userService": {
      "state": "CLOSED",
      "failureRate": "0%",
      "slowCallRate": "0%",
      "numberOfBufferedCalls": 5,
      "numberOfFailedCalls": 0
    }
  }
}

Best Practices

Tune thresholds: Start with defaults, adjust based on real traffic patterns
Meaningful fallbacks: Cached data, default values, or graceful degradation
Monitor metrics: Track circuit states, failure rates, and response times
Test failure scenarios: Chaos engineering - deliberately break things
Don't hide all errors: Let some failures surface so you know there's a problem

Circuit Breaker Pattern