The Problem
Imagine Service A calls Service B, which calls Service C. Service C is down. What happens? Service B keeps trying and timing out. Service A waits for B. Every request piles up. Soon, your entire system is frozen - one failed service took everything down.
A circuit breaker is like an electrical circuit breaker: when things fail too often, it "trips" and stops trying, returning a fallback response instead.
// WITHOUT Circuit Breaker User user = userService.getUser(id); // Waits 30 seconds... timeout User user = userService.getUser(id); // Waits 30 seconds again... // Thread pool exhausted, system crashes // WITH Circuit Breaker User user = userService.getUser(id); // Fails User user = userService.getUser(id); // Fails User user = userService.getUser(id); // Circuit OPENS! User user = userService.getUser(id); // Returns fallback immediately
Circuit Breaker States
CLOSED (Normal)
↓ failures exceed threshold
OPEN (Failing fast - returns fallback immediately)
↓ wait duration expires
HALF-OPEN (Testing - allows limited requests)
↓ requests succeed → CLOSED
↓ requests fail → OPEN
Closed
Normal operation. Requests pass through. Failures are counted.
Open
Too many failures. Requests fail immediately with fallback.
Half-Open
Testing recovery. Limited requests allowed to check if service is back.
Resilience4j Setup
<!-- pom.xml -->
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-aop</artifactId>
</dependency>
Basic Circuit Breaker
@Service
public class UserService {
@Autowired
private UserClient userClient;
@CircuitBreaker(name = "userService", fallbackMethod = "getUserFallback")
public User getUser(Long id) {
return userClient.getUser(id); // Calls external service
}
// Fallback when circuit is open or call fails
public User getUserFallback(Long id, Exception ex) {
log.warn("Circuit breaker triggered for user {}: {}", id, ex.getMessage());
return new User(id, "Unknown", "Unavailable"); // Default user
}
}
Configuration
# application.yml
resilience4j:
circuitbreaker:
instances:
userService:
# When to open the circuit
failure-rate-threshold: 50 # Open if 50% of calls fail
slow-call-rate-threshold: 100 # Or if 100% of calls are slow
slow-call-duration-threshold: 2s # What counts as "slow"
# How many calls to evaluate
sliding-window-type: COUNT_BASED
sliding-window-size: 10 # Last 10 calls
minimum-number-of-calls: 5 # Need at least 5 calls to evaluate
# Recovery
wait-duration-in-open-state: 30s # Wait before trying again
permitted-number-of-calls-in-half-open-state: 3 # Test calls
# What counts as failure
record-exceptions:
- java.io.IOException
- java.net.SocketTimeoutException
ignore-exceptions:
- com.example.BusinessException
Retry Pattern
@Service
public class PaymentService {
@Retry(name = "paymentService", fallbackMethod = "paymentFallback")
public PaymentResult processPayment(Payment payment) {
return paymentClient.process(payment);
}
public PaymentResult paymentFallback(Payment payment, Exception ex) {
log.error("Payment failed after retries: {}", ex.getMessage());
return PaymentResult.pending("Will retry later");
}
}
# application.yml
resilience4j:
retry:
instances:
paymentService:
max-attempts: 3
wait-duration: 1s
exponential-backoff-multiplier: 2 # 1s, 2s, 4s
retry-exceptions:
- java.io.IOException
ignore-exceptions:
- com.example.InvalidPaymentException
Rate Limiter
@Service
public class ApiService {
@RateLimiter(name = "apiService", fallbackMethod = "rateLimitFallback")
public Response callExternalApi(Request request) {
return externalClient.call(request);
}
public Response rateLimitFallback(Request request, Exception ex) {
throw new TooManyRequestsException("Rate limit exceeded. Try later.");
}
}
resilience4j:
ratelimiter:
instances:
apiService:
limit-for-period: 10 # 10 requests
limit-refresh-period: 1s # per second
timeout-duration: 0s # Don't wait, fail immediately
Bulkhead Pattern
Isolate resources to prevent one slow service from consuming all threads.
@Service
public class OrderService {
@Bulkhead(name = "orderService", type = Bulkhead.Type.THREADPOOL)
public Order processOrder(Order order) {
return orderProcessor.process(order);
}
}
resilience4j:
bulkhead:
instances:
orderService:
max-concurrent-calls: 10 # Max 10 concurrent requests
max-wait-duration: 0s
thread-pool-bulkhead:
instances:
orderService:
max-thread-pool-size: 10
core-thread-pool-size: 5
queue-capacity: 20
Combining Patterns
@Service
public class ProductService {
// Order matters! Retry → CircuitBreaker → RateLimiter → Bulkhead
@Retry(name = "productService")
@CircuitBreaker(name = "productService", fallbackMethod = "getProductFallback")
@RateLimiter(name = "productService")
@Bulkhead(name = "productService")
public Product getProduct(Long id) {
return productClient.getProduct(id);
}
public Product getProductFallback(Long id, Exception ex) {
return productCache.get(id) // Try cache first
.orElse(Product.unavailable(id));
}
}
Monitoring with Actuator
<!-- pom.xml -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
# application.yml
management:
endpoints:
web:
exposure:
include: health, circuitbreakers, retries, ratelimiters
health:
circuitbreakers:
enabled: true
# Check circuit breaker status
GET /actuator/circuitbreakers
# Response:
{
"circuitBreakers": {
"userService": {
"state": "CLOSED",
"failureRate": "0%",
"slowCallRate": "0%",
"numberOfBufferedCalls": 5,
"numberOfFailedCalls": 0
}
}
}
Best Practices
- Tune thresholds: Start with defaults, adjust based on real traffic patterns
- Meaningful fallbacks: Cached data, default values, or graceful degradation
- Monitor metrics: Track circuit states, failure rates, and response times
- Test failure scenarios: Chaos engineering - deliberately break things
- Don't hide all errors: Let some failures surface so you know there's a problem