Health Monitoring & Recovery
Easy AppServer implements comprehensive health monitoring for deployed applications with automatic recovery, graceful degradation, and operational visibility through multiple health check mechanisms.
Overview
The platform monitors health at multiple levels: HTTP endpoints for the appserver itself, Docker container health for deployed apps, and stream activity tracking for long-lived streams (activities and hooks). When unhealthy conditions are detected, automated recovery procedures attempt to restore service.
HTTP Health Endpoints
Code Reference: pkg/v2/presentation/http/handlers/health.go:11
/health - Liveness Probe
Indicates whether the appserver process is alive:
GET /health
Response 200 OK:
{
"status": "healthy",
"uptime": 3600.5
}
Use Cases:
- Kubernetes liveness probe
- Load balancer health checks
- Monitoring system alerts
Implementation:
func (h *HealthHandler) Health(w http.ResponseWriter, r *http.Request) {
response := map[string]interface{}{
"status": "healthy",
"uptime": time.Since(h.startTime).Seconds(),
}
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(response)
}
/ready - Readiness Probe
Indicates whether the appserver can handle traffic:
GET /ready
Response 200 OK:
{
"status": "ready"
}
(Future Enhancement):
{
"status": "ready",
"checks": {
"database": "ok",
"rabbitmq": "ok",
"openfga": "ok",
"redis": "ok"
}
}
Use Cases:
- Kubernetes readiness probe
- Rolling deployment coordination
- Traffic routing decisions
Docker Health Monitor
Monitors deployed Docker containers and performs auto-recovery:
Code Reference: pkg/v2/application/orchestration/health_monitor.go:16
Monitor Loop
type HealthMonitor struct {
deploymentRepo repository.DeploymentRepository
dockerClient docker.Client
eventBus event.Bus
config config.DockerConfig
logger telemetry.Logger
}
func (h *HealthMonitor) Start(ctx context.Context) {
ticker := time.NewTicker(h.config.HealthCheckInterval)
for {
select {
case <-ticker.C:
h.performHealthChecks(ctx)
case <-ctx.Done():
return
}
}
}
Configuration:
Health Check Interval: 30 seconds (default)
Max Restarts: 3 per deployment
Restart Delay: Exponential backoff (2s, 4s, 8s)
Health Check Process
Health Monitor Cycle:
├─ Query healthy deployments from repository
├─ Inspect each container via Docker API
├─ Check container health status
│ ├─ Healthy: No action
│ ├─ Unhealthy: Mark as unhealthy, attempt restart
│ └─ Exited/Stopped: Restart immediately
├─ Check unhealthy deployments for recovery
├─ Update deployment states
└─ Publish health events
Container Health States
Healthy States:
- running (with healthy health check)
- running (no health check configured)
Unhealthy States:
- unhealthy (health check failing)
- exited
- dead
- restarting (stuck)
Auto-Recovery
When a container becomes unhealthy:
1. Mark deployment as StateUnhealthy
2. Check restart count < MaxRestarts
3. Stop existing container
4. Remove container
5. Create new container from same image
6. Start container
7. Increment restart count
8. Update deployment state to StateHealthy
9. Publish deployment.recovered event
If max restarts exceeded:
1. Mark deployment as StateFailed
2. Stop recovery attempts
3. Publish deployment.failed event
4. Alert operators
Activity Heartbeat Monitor
Long-lived activity handlers track liveness through stream activity:
Code Reference: pkg/v2/application/activity/activity_service.go:108
Heartbeat Mechanism
func (as *ActivityService) StartHeartbeatMonitor(
ctx context.Context,
cleanupInterval time.Duration,
staleTimeout time.Duration,
) {
as.streamManager.StartHeartbeatMonitor(ctx, cleanupInterval, staleTimeout)
}
Configuration:
Cleanup Interval: 60 seconds (server checks)
Stale Timeout: 2 minutes (handler considered dead)
Handler Tracking
type ActivityHandler struct {
ID string
ActivityName string
AppName string
Stream interface{}
RegisteredAt time.Time
LastHeartbeat time.Time // Updated when responses are sent
}
Stale Handler Cleanup
Heartbeat Monitor:
├─ Every 60 seconds:
│ ├─ Iterate all registered handlers
│ ├─ Check time since LastHeartbeat
│ ├─ If > 2 minutes:
│ │ ├─ Mark as stale
│ │ ├─ Unregister handler
│ │ ├─ Close stream
│ │ └─ Log warning
│ └─ Continue
Liveness Tracking: Handler LastHeartbeat timestamps are updated when handlers send responses on their streams. There is no dedicated heartbeat RPC (UpdateHandlerHeartbeat is server-internal at pkg/v2/application/activity/activity_service.go:88-95 with no client-facing gRPC endpoint in easy.proto/v2/protos/services.proto:265-313).
Handlers remain active as long as they process requests and send responses. Inactive handlers exceeding the stale timeout are automatically cleaned up.
Hook Stream Health
Hook listeners track liveness through trigger responses:
Liveness Tracking
Hook Stream:
├─ Server tracks LastHeartbeat per listener
├─ Timestamp updated when trigger responses received
├─ If > 2min since LastHeartbeat:
│ ├─ Consider stream dead
│ ├─ Unregister listener
│ └─ Clean up resources
Hook heartbeats work via trigger response activity (pkg/v2/application/hooks/hooks_service.go:150-230), not dedicated heartbeat messages. Listeners stay alive by responding to hook triggers. There is no separate keep-alive mechanism.
Disconnection Detection
Stream Monitoring:
├─ Detect stream errors (network, EOF)
├─ Unregister listener immediately
├─ Publish listener.disconnected event
└─ Client responsible for reconnection
Graceful Degradation
When dependencies become unavailable:
Database Unavailable
Impact:
- Queries fail
- Apps cannot be installed/uninstalled
- Settings cannot be updated
Degradation:
- Return cached data where possible
- Accept writes to queue (future)
- Return 503 Service Unavailable for mutations
RabbitMQ Unavailable
Impact:
- Events not published
- Cache invalidation delayed
- Subscriptions not working
Degradation:
- Fall back to in-memory event bus (local only)
- Disable distributed features
- Log warnings
OpenFGA Unavailable
Impact:
- Permission checks fail
- Cannot authorize operations
Degradation:
- Deny all permission checks (fail-secure)
- Return 503 for protected operations
- Alert operators
Redis Unavailable
Impact:
- L2 permission cache unavailable
- Slower permission checks
Degradation:
- Use only L1 (in-memory) cache
- Query OpenFGA directly more often
- Acceptable performance degradation
Metrics & Observability
Exported Metrics
(Future):
Appserver Metrics:
- appserver_http_requests_total
- appserver_http_request_duration_seconds
- appserver_grpc_requests_total
- appserver_grpc_request_duration_seconds
- appserver_cache_hits_total
- appserver_cache_misses_total
Health Metrics:
- appserver_deployment_health_checks_total
- appserver_deployment_restarts_total
- appserver_activity_handlers_active
- appserver_hook_listeners_active
- appserver_stale_handlers_cleaned_total
Health Events
Published via event bus:
Events:
- deployment.unhealthy
- deployment.recovered
- deployment.failed
- handler.stale
- handler.reconnected
Best Practices
For Platform Operators
Configure Appropriate Intervals:
docker:
healthCheckInterval: 30s # Balance responsiveness vs load
maxRestarts: 3 # Prevent infinite restart loops
activity:
cleanupInterval: 60s # How often to check for stale handlers
staleTimeout: 2m # Inactivity period before cleanup
Monitor Health Dashboards:
- Deployment health status
- Restart frequency (alert if > 1/hour/app)
- Stale handler cleanup rate
- Dependency availability
Set Up Alerts:
- Deployment failed (max restarts exceeded)
- Multiple deployments unhealthy
- Database/RabbitMQ unavailable
- High stale handler rate
For App Developers
Implement Health Checks in Containers:
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
Maintain Stream Activity:
Activity and hook handlers stay alive through regular stream activity (sending responses). Handlers that remain idle for extended periods (exceeding the stale timeout) are automatically cleaned up. Ensure handlers process requests regularly or implement reconnection logic if needed.
Handle Reconnections:
function connectActivityHandler() {
const stream = activityClient.RegisterActivity();
stream.on('error', (err) => {
logger.error('Stream error', err);
setTimeout(connectActivityHandler, 5000); // Retry after 5s
});
stream.on('end', () => {
logger.info('Stream ended');
setTimeout(connectActivityHandler, 1000); // Reconnect immediately
});
}
Troubleshooting
Deployment Keeps Restarting
Problem: Container restarts repeatedly
Diagnosis:
1. Check container logs: docker logs <container-id>
2. Check health check definition in Dockerfile
3. Review application startup time
4. Check resource limits (CPU, memory)
Solutions:
- Increase health check start period
- Fix application startup issues
- Adjust resource limits
- Review health check endpoint
Handler Marked as Stale
Problem: Activity/Hook handler cleaned up as stale
Diagnosis:
1. Check handler is sending responses regularly
2. Review network connectivity
3. Check server stale timeout config
Solutions:
- Ensure handler processes requests and sends responses
- Fix network issues preventing stream activity
- Increase stale timeout if legitimate processing delays
- Implement reconnection logic for long-idle periods
Deployment Marked as Failed
Problem: Deployment entered StateFailed after max restarts
Diagnosis:
1. Review deployment restart history
2. Check application logs
3. Identify root cause of failures
Solutions:
- Fix application issue
- Increase max restarts if transient issue
- Manually redeploy after fix
Related Concepts
- Application Lifecycle - Deployment states
- App State Machine - State transitions
- Activities & Background Workflows - Activity heartbeats
- Hooks Architecture - Hook stream health
- Platform Architecture - Dependency health
Further Reading
- Kubernetes Health Checks - Health check patterns
- Docker Health Checks - Container health