Skip to main content

Health Monitoring & Recovery

Easy AppServer implements comprehensive health monitoring for deployed applications with automatic recovery, graceful degradation, and operational visibility through multiple health check mechanisms.

Overview

The platform monitors health at multiple levels: HTTP endpoints for the appserver itself, Docker container health for deployed apps, and stream activity tracking for long-lived streams (activities and hooks). When unhealthy conditions are detected, automated recovery procedures attempt to restore service.

HTTP Health Endpoints

Code Reference: pkg/v2/presentation/http/handlers/health.go:11

/health - Liveness Probe

Indicates whether the appserver process is alive:

GET /health

Response 200 OK:
{
"status": "healthy",
"uptime": 3600.5
}

Use Cases:

  • Kubernetes liveness probe
  • Load balancer health checks
  • Monitoring system alerts

Implementation:

func (h *HealthHandler) Health(w http.ResponseWriter, r *http.Request) {
response := map[string]interface{}{
"status": "healthy",
"uptime": time.Since(h.startTime).Seconds(),
}
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(response)
}

/ready - Readiness Probe

Indicates whether the appserver can handle traffic:

GET /ready

Response 200 OK:
{
"status": "ready"
}

(Future Enhancement):

{
"status": "ready",
"checks": {
"database": "ok",
"rabbitmq": "ok",
"openfga": "ok",
"redis": "ok"
}
}

Use Cases:

  • Kubernetes readiness probe
  • Rolling deployment coordination
  • Traffic routing decisions

Docker Health Monitor

Monitors deployed Docker containers and performs auto-recovery:

Code Reference: pkg/v2/application/orchestration/health_monitor.go:16

Monitor Loop

type HealthMonitor struct {
deploymentRepo repository.DeploymentRepository
dockerClient docker.Client
eventBus event.Bus
config config.DockerConfig
logger telemetry.Logger
}

func (h *HealthMonitor) Start(ctx context.Context) {
ticker := time.NewTicker(h.config.HealthCheckInterval)
for {
select {
case <-ticker.C:
h.performHealthChecks(ctx)
case <-ctx.Done():
return
}
}
}

Configuration:

Health Check Interval: 30 seconds (default)
Max Restarts: 3 per deployment
Restart Delay: Exponential backoff (2s, 4s, 8s)

Health Check Process

Health Monitor Cycle:
├─ Query healthy deployments from repository
├─ Inspect each container via Docker API
├─ Check container health status
│ ├─ Healthy: No action
│ ├─ Unhealthy: Mark as unhealthy, attempt restart
│ └─ Exited/Stopped: Restart immediately
├─ Check unhealthy deployments for recovery
├─ Update deployment states
└─ Publish health events

Container Health States

Healthy States:
- running (with healthy health check)
- running (no health check configured)

Unhealthy States:
- unhealthy (health check failing)
- exited
- dead
- restarting (stuck)

Auto-Recovery

When a container becomes unhealthy:

1. Mark deployment as StateUnhealthy
2. Check restart count < MaxRestarts
3. Stop existing container
4. Remove container
5. Create new container from same image
6. Start container
7. Increment restart count
8. Update deployment state to StateHealthy
9. Publish deployment.recovered event

If max restarts exceeded:

1. Mark deployment as StateFailed
2. Stop recovery attempts
3. Publish deployment.failed event
4. Alert operators

Activity Heartbeat Monitor

Long-lived activity handlers track liveness through stream activity:

Code Reference: pkg/v2/application/activity/activity_service.go:108

Heartbeat Mechanism

func (as *ActivityService) StartHeartbeatMonitor(
ctx context.Context,
cleanupInterval time.Duration,
staleTimeout time.Duration,
) {
as.streamManager.StartHeartbeatMonitor(ctx, cleanupInterval, staleTimeout)
}

Configuration:

Cleanup Interval: 60 seconds (server checks)
Stale Timeout: 2 minutes (handler considered dead)

Handler Tracking

type ActivityHandler struct {
ID string
ActivityName string
AppName string
Stream interface{}
RegisteredAt time.Time
LastHeartbeat time.Time // Updated when responses are sent
}

Stale Handler Cleanup

Heartbeat Monitor:
├─ Every 60 seconds:
│ ├─ Iterate all registered handlers
│ ├─ Check time since LastHeartbeat
│ ├─ If > 2 minutes:
│ │ ├─ Mark as stale
│ │ ├─ Unregister handler
│ │ ├─ Close stream
│ │ └─ Log warning
│ └─ Continue
Implementation Status

Liveness Tracking: Handler LastHeartbeat timestamps are updated when handlers send responses on their streams. There is no dedicated heartbeat RPC (UpdateHandlerHeartbeat is server-internal at pkg/v2/application/activity/activity_service.go:88-95 with no client-facing gRPC endpoint in easy.proto/v2/protos/services.proto:265-313).

Handlers remain active as long as they process requests and send responses. Inactive handlers exceeding the stale timeout are automatically cleaned up.

Hook Stream Health

Hook listeners track liveness through trigger responses:

Liveness Tracking

Hook Stream:
├─ Server tracks LastHeartbeat per listener
├─ Timestamp updated when trigger responses received
├─ If > 2min since LastHeartbeat:
│ ├─ Consider stream dead
│ ├─ Unregister listener
│ └─ Clean up resources
Implementation Status

Hook heartbeats work via trigger response activity (pkg/v2/application/hooks/hooks_service.go:150-230), not dedicated heartbeat messages. Listeners stay alive by responding to hook triggers. There is no separate keep-alive mechanism.

Disconnection Detection

Stream Monitoring:
├─ Detect stream errors (network, EOF)
├─ Unregister listener immediately
├─ Publish listener.disconnected event
└─ Client responsible for reconnection

Graceful Degradation

When dependencies become unavailable:

Database Unavailable

Impact:
- Queries fail
- Apps cannot be installed/uninstalled
- Settings cannot be updated

Degradation:
- Return cached data where possible
- Accept writes to queue (future)
- Return 503 Service Unavailable for mutations

RabbitMQ Unavailable

Impact:
- Events not published
- Cache invalidation delayed
- Subscriptions not working

Degradation:
- Fall back to in-memory event bus (local only)
- Disable distributed features
- Log warnings

OpenFGA Unavailable

Impact:
- Permission checks fail
- Cannot authorize operations

Degradation:
- Deny all permission checks (fail-secure)
- Return 503 for protected operations
- Alert operators

Redis Unavailable

Impact:
- L2 permission cache unavailable
- Slower permission checks

Degradation:
- Use only L1 (in-memory) cache
- Query OpenFGA directly more often
- Acceptable performance degradation

Metrics & Observability

Exported Metrics

(Future):

Appserver Metrics:
- appserver_http_requests_total
- appserver_http_request_duration_seconds
- appserver_grpc_requests_total
- appserver_grpc_request_duration_seconds
- appserver_cache_hits_total
- appserver_cache_misses_total

Health Metrics:
- appserver_deployment_health_checks_total
- appserver_deployment_restarts_total
- appserver_activity_handlers_active
- appserver_hook_listeners_active
- appserver_stale_handlers_cleaned_total

Health Events

Published via event bus:

Events:
- deployment.unhealthy
- deployment.recovered
- deployment.failed
- handler.stale
- handler.reconnected

Best Practices

For Platform Operators

Configure Appropriate Intervals:

docker:
healthCheckInterval: 30s # Balance responsiveness vs load
maxRestarts: 3 # Prevent infinite restart loops

activity:
cleanupInterval: 60s # How often to check for stale handlers
staleTimeout: 2m # Inactivity period before cleanup

Monitor Health Dashboards:

- Deployment health status
- Restart frequency (alert if > 1/hour/app)
- Stale handler cleanup rate
- Dependency availability

Set Up Alerts:

- Deployment failed (max restarts exceeded)
- Multiple deployments unhealthy
- Database/RabbitMQ unavailable
- High stale handler rate

For App Developers

Implement Health Checks in Containers:

HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1

Maintain Stream Activity:

Activity and hook handlers stay alive through regular stream activity (sending responses). Handlers that remain idle for extended periods (exceeding the stale timeout) are automatically cleaned up. Ensure handlers process requests regularly or implement reconnection logic if needed.

Handle Reconnections:

function connectActivityHandler() {
const stream = activityClient.RegisterActivity();

stream.on('error', (err) => {
logger.error('Stream error', err);
setTimeout(connectActivityHandler, 5000); // Retry after 5s
});

stream.on('end', () => {
logger.info('Stream ended');
setTimeout(connectActivityHandler, 1000); // Reconnect immediately
});
}

Troubleshooting

Deployment Keeps Restarting

Problem: Container restarts repeatedly

Diagnosis:
1. Check container logs: docker logs <container-id>
2. Check health check definition in Dockerfile
3. Review application startup time
4. Check resource limits (CPU, memory)

Solutions:
- Increase health check start period
- Fix application startup issues
- Adjust resource limits
- Review health check endpoint

Handler Marked as Stale

Problem: Activity/Hook handler cleaned up as stale

Diagnosis:
1. Check handler is sending responses regularly
2. Review network connectivity
3. Check server stale timeout config

Solutions:
- Ensure handler processes requests and sends responses
- Fix network issues preventing stream activity
- Increase stale timeout if legitimate processing delays
- Implement reconnection logic for long-idle periods

Deployment Marked as Failed

Problem: Deployment entered StateFailed after max restarts

Diagnosis:
1. Review deployment restart history
2. Check application logs
3. Identify root cause of failures

Solutions:
- Fix application issue
- Increase max restarts if transient issue
- Manually redeploy after fix

Further Reading