Skip to main content

Circuit Breaker

Per-upstream circuit breaker implementation to prevent cascading failures and provide fast-fail behavior for failing services.

Overview

Based on pkg/v2/infrastructure/circuitbreaker/, the circuit breaker provides:

  • Three States: Closed, Open, Half-Open with automatic transitions
  • Per-Upstream: Separate circuit breakers for each upstream service
  • Configurable Thresholds: Failure count, success count, timeout
  • Metrics: Prometheus metrics for state transitions and request outcomes
  • Thread-Safe: Lock-free operations using atomic counters

Circuit Breaker States

State Machine

┌──────────┐
│ CLOSED │ Normal operation
│ │ All requests pass through
│ │ Failures counted
└────┬─────┘
│ Failures >= Threshold

┌──────────┐
│ OPEN │ Fast-fail mode
│ │ Reject new requests immediately
│ │ Wait for timeout
└────┬─────┘
│ Timeout expired

┌──────────┐
│ HALF-OPEN│ Recovery testing
│ │ Limited requests allowed
│ │ Test if service recovered
└────┬─────┘
│ Successes >= Threshold → CLOSED
│ Any Failure → OPEN

State Descriptions

CLOSED (Normal Operation):

  • All requests pass through
  • Failures are counted
  • When failures >= threshold → transition to OPEN
  • Successes reset failure counter

OPEN (Fast-Fail):

  • All requests fail immediately with ErrCircuitOpen
  • No requests sent to upstream
  • Wait for configured timeout
  • After timeout → transition to HALF-OPEN

HALF-OPEN (Recovery Testing):

  • Limited number of concurrent requests allowed
  • Success increments success counter
  • When successes >= threshold → transition to CLOSED
  • Any failure → transition back to OPEN

Implementation

Based on breaker_impl.go:

CircuitBreaker Structure

type circuitBreaker struct {
name string
config *Config

state atomic.Value // State
failures atomic.Int32
successes atomic.Int32
halfOpenRequests atomic.Int32
openedAt time.Time
}

Configuration

type Config struct {
FailureThreshold int32 // Failures before opening (default: 5)
SuccessThreshold int32 // Successes to close from half-open (default: 2)
Timeout time.Duration // Time before half-open (default: 60s)
HalfOpenRequests int32 // Concurrent requests in half-open (default: 3)
}

Default Configuration

func DefaultConfig() *Config {
return &Config{
FailureThreshold: 5,
SuccessThreshold: 2,
Timeout: 60 * time.Second,
HalfOpenRequests: 3,
}
}

Circuit Breaker Operations

Execute

Execute function through circuit breaker:

cb := NewCircuitBreaker("todos-bff", config)

err := cb.Call(ctx, func() error {
resp, err := httpClient.Do(req)
if err != nil {
return err // Network error → recorded as failure
}
if resp.StatusCode >= 500 {
return &ProxyError{StatusCode: resp.StatusCode} // 5xx → recorded as failure
}
return nil // Success
})

if err == ErrCircuitOpen {
// Circuit breaker is open, fast-fail
}

Call Flow

1. Check if can proceed
├─ CLOSED → Allow
├─ OPEN → Check timeout
│ ├─ Timeout expired → Transition to HALF-OPEN, Allow
│ └─ Timeout not expired → Reject with ErrCircuitOpen
└─ HALF-OPEN → Check concurrent limit
├─ Under limit → Increment counter, Allow
└─ At limit → Reject with ErrCircuitOpen

2. Execute function
└─ Call provided function

3. Record result
├─ Success → RecordSuccess()
└─ Error → RecordFailure()

RecordFailure

func (cb *circuitBreaker) RecordFailure() {
state := cb.GetState()

switch state {
case StateClosed:
failures := cb.failures.Add(1)
if failures >= cb.config.FailureThreshold {
cb.mu.Lock()
cb.state.Store(StateOpen)
cb.openedAt = time.Now()
cb.mu.Unlock()
}

case StateHalfOpen:
cb.mu.Lock()
cb.state.Store(StateOpen)
cb.openedAt = time.Now()
cb.halfOpenRequests.Store(0)
cb.mu.Unlock()
}
}

RecordSuccess

func (cb *circuitBreaker) RecordSuccess() {
state := cb.GetState()

switch state {
case StateClosed:
cb.failures.Store(0) // Reset failure count

case StateHalfOpen:
cb.halfOpenRequests.Add(-1) // Request completed
successes := cb.successes.Add(1)

if successes >= cb.config.SuccessThreshold {
cb.mu.Lock()
cb.state.Store(StateClosed)
cb.failures.Store(0)
cb.successes.Store(0)
cb.mu.Unlock()
}
}
}

Circuit Breaker Manager

Based on manager.go:

Manager Structure

type Manager interface {
Execute(ctx context.Context, name string, fn func() error) error
GetBreaker(name string) CircuitBreaker
GetAllBreakers() map[string]CircuitBreaker
}

type manager struct {
breakers sync.Map // name -> CircuitBreaker
config *Config
}

Per-Upstream Circuit Breakers

// Create manager with default config
mgr := NewManager(DefaultConfig())

// Execute request through circuit breaker for specific upstream
err := mgr.Execute(ctx, "http://todos-bff.svc:8080", func() error {
resp, err := httpClient.Do(req)
// ...
})

// Separate circuit breaker for each upstream
err = mgr.Execute(ctx, "http://analytics.svc:8080", func() error {
// Different circuit breaker instance
})

Benefits:

  • Isolate failures per upstream
  • One slow upstream doesn't affect others
  • Automatic breaker creation per upstream URL

Proxy Integration

From pkg/v2/application/proxy/proxy_service.go:

Hybrid Approach

err = s.breakerManager.Execute(reqCtx, upstreamURL, func() error {
var execErr error
resp, execErr = s.client.Do(upstreamReq)

if execErr != nil {
return execErr // Network error → circuit breaker records failure
}

// Hybrid: 5xx treated as failure for CB but response returned to caller
if resp.StatusCode >= 500 {
return &ProxyError{
StatusCode: resp.StatusCode,
Code: "UPSTREAM_ERROR",
}
}

return nil
})

Hybrid Benefits:

  • Network errors: Circuit breaker opens, requests fail fast
  • 5xx responses: Circuit breaker records failure, but response passed to caller
  • Caller can handle 5xx appropriately (retry, fallback, error message)

Metrics

Based on metrics.go:

Prometheus Metrics

State Gauge:

circuit_breaker_state{name="http://todos-bff.svc:8080"} 0  # CLOSED
circuit_breaker_state{name="http://todos-bff.svc:8080"} 1 # OPEN
circuit_breaker_state{name="http://todos-bff.svc:8080"} 2 # HALF-OPEN

Request Counter:

circuit_breaker_requests_total{name="...", result="success"} 1250
circuit_breaker_requests_total{name="...", result="failure"} 50
circuit_breaker_requests_total{name="...", result="rejected"} 10

State Transitions Counter:

circuit_breaker_state_changes_total{name="...", from="closed", to="open"} 3
circuit_breaker_state_changes_total{name="...", from="open", to="half_open"} 3
circuit_breaker_state_changes_total{name="...", from="half_open", to="closed"} 2

Configuration Examples

Conservative (Slower to Open)

config := &Config{
FailureThreshold: 10, // More failures before opening
SuccessThreshold: 5, // More successes to close
Timeout: 120 * time.Second, // Longer wait before testing
HalfOpenRequests: 5, // More test requests
}

Use Case: Stable services with occasional transient failures

Aggressive (Faster to Open)

config := &Config{
FailureThreshold: 3, // Open quickly
SuccessThreshold: 1, // Close quickly if recovered
Timeout: 30 * time.Second, // Test recovery soon
HalfOpenRequests: 2, // Minimal test requests
}

Use Case: Unstable services, prefer fast-fail over retries

Balanced (Default)

config := &Config{
FailureThreshold: 5,
SuccessThreshold: 2,
Timeout: 60 * time.Second,
HalfOpenRequests: 3,
}

Use Case: Most production scenarios

Best Practices

Configuration

DO:

  • Use separate circuit breakers per upstream
  • Configure thresholds based on upstream SLA
  • Monitor circuit breaker state transitions
  • Tune based on observed failure patterns

DON'T:

  • Use same circuit breaker for multiple upstreams
  • Set failure threshold too low (causes false positives)
  • Set timeout too long (delayed recovery)
  • Ignore circuit breaker metrics

Error Classification

DO:

  • Treat network errors as failures
  • Treat 5xx responses as failures
  • Treat timeouts as failures
  • Reset failure count on success

DON'T:

  • Treat 4xx as failures (client errors)
  • Treat slow responses as success (use timeouts)

Recovery Testing

DO:

  • Limit concurrent requests in half-open
  • Transition to closed quickly on success
  • Transition to open immediately on failure
  • Log state transitions for debugging

DON'T:

  • Allow unlimited requests in half-open
  • Require too many successes to close
  • Stay in half-open too long

Monitoring

Key Metrics:

  • Circuit breaker state per upstream
  • Request rejection rate
  • State transition frequency
  • Time spent in each state

Alerts:

  • Circuit breaker open for > 5 minutes
  • Frequent open/close oscillation (indicates threshold tuning needed)
  • High request rejection rate

Code References

ComponentFilePurpose
CircuitBreakerbreaker_impl.goCore circuit breaker logic
Managermanager.goPer-upstream breaker management
Configconfig.goConfiguration structures
Statestate.goState enum and helpers
Metricsmetrics.goPrometheus metrics