Circuit Breaker
Per-upstream circuit breaker implementation to prevent cascading failures and provide fast-fail behavior for failing services.
Overview
Based on pkg/v2/infrastructure/circuitbreaker/, the circuit breaker provides:
- Three States: Closed, Open, Half-Open with automatic transitions
- Per-Upstream: Separate circuit breakers for each upstream service
- Configurable Thresholds: Failure count, success count, timeout
- Metrics: Prometheus metrics for state transitions and request outcomes
- Thread-Safe: Lock-free operations using atomic counters
Circuit Breaker States
State Machine
┌──────────┐
│ CLOSED │ Normal operation
│ │ All requests pass through
│ │ Failures counted
└────┬─────┘
│ Failures >= Threshold
↓
┌──────────┐
│ OPEN │ Fast-fail mode
│ │ Reject new requests immediately
│ │ Wait for timeout
└────┬─────┘
│ Timeout expired
↓
┌──────────┐
│ HALF-OPEN│ Recovery testing
│ │ Limited requests allowed
│ │ Test if service recovered
└────┬─────┘
│ Successes >= Threshold → CLOSED
│ Any Failure → OPEN
State Descriptions
CLOSED (Normal Operation):
- All requests pass through
- Failures are counted
- When failures >= threshold → transition to OPEN
- Successes reset failure counter
OPEN (Fast-Fail):
- All requests fail immediately with
ErrCircuitOpen - No requests sent to upstream
- Wait for configured timeout
- After timeout → transition to HALF-OPEN
HALF-OPEN (Recovery Testing):
- Limited number of concurrent requests allowed
- Success increments success counter
- When successes >= threshold → transition to CLOSED
- Any failure → transition back to OPEN
Implementation
Based on breaker_impl.go:
CircuitBreaker Structure
type circuitBreaker struct {
name string
config *Config
state atomic.Value // State
failures atomic.Int32
successes atomic.Int32
halfOpenRequests atomic.Int32
openedAt time.Time
}
Configuration
type Config struct {
FailureThreshold int32 // Failures before opening (default: 5)
SuccessThreshold int32 // Successes to close from half-open (default: 2)
Timeout time.Duration // Time before half-open (default: 60s)
HalfOpenRequests int32 // Concurrent requests in half-open (default: 3)
}
Default Configuration
func DefaultConfig() *Config {
return &Config{
FailureThreshold: 5,
SuccessThreshold: 2,
Timeout: 60 * time.Second,
HalfOpenRequests: 3,
}
}
Circuit Breaker Operations
Execute
Execute function through circuit breaker:
cb := NewCircuitBreaker("todos-bff", config)
err := cb.Call(ctx, func() error {
resp, err := httpClient.Do(req)
if err != nil {
return err // Network error → recorded as failure
}
if resp.StatusCode >= 500 {
return &ProxyError{StatusCode: resp.StatusCode} // 5xx → recorded as failure
}
return nil // Success
})
if err == ErrCircuitOpen {
// Circuit breaker is open, fast-fail
}
Call Flow
1. Check if can proceed
├─ CLOSED → Allow
├─ OPEN → Check timeout
│ ├─ Timeout expired → Transition to HALF-OPEN, Allow
│ └─ Timeout not expired → Reject with ErrCircuitOpen
└─ HALF-OPEN → Check concurrent limit
├─ Under limit → Increment counter, Allow
└─ At limit → Reject with ErrCircuitOpen
2. Execute function
└─ Call provided function
3. Record result
├─ Success → RecordSuccess()
└─ Error → RecordFailure()
RecordFailure
func (cb *circuitBreaker) RecordFailure() {
state := cb.GetState()
switch state {
case StateClosed:
failures := cb.failures.Add(1)
if failures >= cb.config.FailureThreshold {
cb.mu.Lock()
cb.state.Store(StateOpen)
cb.openedAt = time.Now()
cb.mu.Unlock()
}
case StateHalfOpen:
cb.mu.Lock()
cb.state.Store(StateOpen)
cb.openedAt = time.Now()
cb.halfOpenRequests.Store(0)
cb.mu.Unlock()
}
}
RecordSuccess
func (cb *circuitBreaker) RecordSuccess() {
state := cb.GetState()
switch state {
case StateClosed:
cb.failures.Store(0) // Reset failure count
case StateHalfOpen:
cb.halfOpenRequests.Add(-1) // Request completed
successes := cb.successes.Add(1)
if successes >= cb.config.SuccessThreshold {
cb.mu.Lock()
cb.state.Store(StateClosed)
cb.failures.Store(0)
cb.successes.Store(0)
cb.mu.Unlock()
}
}
}
Circuit Breaker Manager
Based on manager.go:
Manager Structure
type Manager interface {
Execute(ctx context.Context, name string, fn func() error) error
GetBreaker(name string) CircuitBreaker
GetAllBreakers() map[string]CircuitBreaker
}
type manager struct {
breakers sync.Map // name -> CircuitBreaker
config *Config
}
Per-Upstream Circuit Breakers
// Create manager with default config
mgr := NewManager(DefaultConfig())
// Execute request through circuit breaker for specific upstream
err := mgr.Execute(ctx, "http://todos-bff.svc:8080", func() error {
resp, err := httpClient.Do(req)
// ...
})
// Separate circuit breaker for each upstream
err = mgr.Execute(ctx, "http://analytics.svc:8080", func() error {
// Different circuit breaker instance
})
Benefits:
- Isolate failures per upstream
- One slow upstream doesn't affect others
- Automatic breaker creation per upstream URL
Proxy Integration
From pkg/v2/application/proxy/proxy_service.go:
Hybrid Approach
err = s.breakerManager.Execute(reqCtx, upstreamURL, func() error {
var execErr error
resp, execErr = s.client.Do(upstreamReq)
if execErr != nil {
return execErr // Network error → circuit breaker records failure
}
// Hybrid: 5xx treated as failure for CB but response returned to caller
if resp.StatusCode >= 500 {
return &ProxyError{
StatusCode: resp.StatusCode,
Code: "UPSTREAM_ERROR",
}
}
return nil
})
Hybrid Benefits:
- Network errors: Circuit breaker opens, requests fail fast
- 5xx responses: Circuit breaker records failure, but response passed to caller
- Caller can handle 5xx appropriately (retry, fallback, error message)
Metrics
Based on metrics.go:
Prometheus Metrics
State Gauge:
circuit_breaker_state{name="http://todos-bff.svc:8080"} 0 # CLOSED
circuit_breaker_state{name="http://todos-bff.svc:8080"} 1 # OPEN
circuit_breaker_state{name="http://todos-bff.svc:8080"} 2 # HALF-OPEN
Request Counter:
circuit_breaker_requests_total{name="...", result="success"} 1250
circuit_breaker_requests_total{name="...", result="failure"} 50
circuit_breaker_requests_total{name="...", result="rejected"} 10
State Transitions Counter:
circuit_breaker_state_changes_total{name="...", from="closed", to="open"} 3
circuit_breaker_state_changes_total{name="...", from="open", to="half_open"} 3
circuit_breaker_state_changes_total{name="...", from="half_open", to="closed"} 2
Configuration Examples
Conservative (Slower to Open)
config := &Config{
FailureThreshold: 10, // More failures before opening
SuccessThreshold: 5, // More successes to close
Timeout: 120 * time.Second, // Longer wait before testing
HalfOpenRequests: 5, // More test requests
}
Use Case: Stable services with occasional transient failures
Aggressive (Faster to Open)
config := &Config{
FailureThreshold: 3, // Open quickly
SuccessThreshold: 1, // Close quickly if recovered
Timeout: 30 * time.Second, // Test recovery soon
HalfOpenRequests: 2, // Minimal test requests
}
Use Case: Unstable services, prefer fast-fail over retries
Balanced (Default)
config := &Config{
FailureThreshold: 5,
SuccessThreshold: 2,
Timeout: 60 * time.Second,
HalfOpenRequests: 3,
}
Use Case: Most production scenarios
Best Practices
Configuration
✅ DO:
- Use separate circuit breakers per upstream
- Configure thresholds based on upstream SLA
- Monitor circuit breaker state transitions
- Tune based on observed failure patterns
❌ DON'T:
- Use same circuit breaker for multiple upstreams
- Set failure threshold too low (causes false positives)
- Set timeout too long (delayed recovery)
- Ignore circuit breaker metrics
Error Classification
✅ DO:
- Treat network errors as failures
- Treat 5xx responses as failures
- Treat timeouts as failures
- Reset failure count on success
❌ DON'T:
- Treat 4xx as failures (client errors)
- Treat slow responses as success (use timeouts)
Recovery Testing
✅ DO:
- Limit concurrent requests in half-open
- Transition to closed quickly on success
- Transition to open immediately on failure
- Log state transitions for debugging
❌ DON'T:
- Allow unlimited requests in half-open
- Require too many successes to close
- Stay in half-open too long
Monitoring
Key Metrics:
- Circuit breaker state per upstream
- Request rejection rate
- State transition frequency
- Time spent in each state
Alerts:
- Circuit breaker open for > 5 minutes
- Frequent open/close oscillation (indicates threshold tuning needed)
- High request rejection rate
Code References
| Component | File | Purpose |
|---|---|---|
| CircuitBreaker | breaker_impl.go | Core circuit breaker logic |
| Manager | manager.go | Per-upstream breaker management |
| Config | config.go | Configuration structures |
| State | state.go | State enum and helpers |
| Metrics | metrics.go | Prometheus metrics |
Related Topics
- HTTP Proxy & Routing - Proxy integration
- Platform Architecture - Infrastructure layer
- Rate Limiting - Complementary protection mechanism