Observability Stack
Complete observability setup with OpenTelemetry, Prometheus, Loki, Tempo, and Grafana for metrics, logs, and distributed tracing.
Overview
The Easy AppServer observability stack follows the three pillars of observability:
Metrics (Prometheus):
- Time-series data for monitoring system health
- Aggregated statistics and trends
- Alerting based on thresholds
- RED metrics (Rate, Errors, Duration)
Logs (Loki):
- Event records with context and details
- Debugging and troubleshooting
- Audit trails
- Structured logging with labels
Traces (Tempo):
- End-to-end request flows across services
- Performance bottleneck identification
- Service dependency mapping
- Correlation with metrics and logs
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Applications │
│ (AppServer, Shell, Apps, Infrastructure Services) │
└─────────────────┬───────────────────────────────────────────┘
│ OTLP (OpenTelemetry Protocol)
│ gRPC:4317 / HTTP:4318
↓
┌─────────────────────────────────────────────────────────────┐
│ OpenTelemetry Collector │
│ Receivers → Processors → Exporters │
└─────┬────────────────┬──────────────────┬──────────────────┘
│ Metrics │ Logs │ Traces
↓ ↓ ↓
┌──────────┐ ┌──────────┐ ┌──────────┐
│Prometheus│ │ Loki │ │ Tempo │
│ :9090 │ │ :3100 │ │ :3200 │
└──────────┘ └──────────┘ └──────────┘
└────────────┬──────────────────┘
↓
┌──────────────┐
│ Grafana │
│ :3000 │
└──────────────┘
OpenTelemetry Collector
Based on docker/observability/otel-collector/otel-config.yml:
Purpose
The OpenTelemetry Collector is the central telemetry hub that:
- Receives telemetry data from applications via OTLP
- Processes and enriches telemetry with additional metadata
- Routes telemetry to appropriate backends (Prometheus, Loki, Tempo)
- Provides vendor-agnostic instrumentation
Configuration
Receivers:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317 # gRPC receiver
http:
endpoint: 0.0.0.0:4318 # HTTP receiver
Processors:
processors:
# Batch telemetry to reduce connections
batch:
timeout: 10s
send_batch_size: 1024
# Add deployment metadata
resource:
attributes:
- key: service.namespace
value: "appserver"
- key: deployment.environment
value: "${DEPLOYMENT_ENVIRONMENT:-development}"
# Prevent OOM
memory_limiter:
check_interval: 1s
limit_percentage: 80
Exporters:
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: appserver
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
otlp/tempo:
endpoint: "tempo:4317"
Pipelines
Traces Pipeline:
OTLP Receiver → Memory Limiter → Batch → Resource → Tempo
Metrics Pipeline:
OTLP Receiver → Memory Limiter → Batch → Resource → Prometheus
Logs Pipeline:
OTLP Receiver → Memory Limiter → Batch → Resource → Loki
Prometheus (Metrics)
Based on docker/observability/prometheus/prometheus.yml:
Configuration
Scrape Targets:
scrape_configs:
# OpenTelemetry Collector metrics
- job_name: 'otel-collector'
scrape_interval: 10s
static_configs:
- targets: ['otel-collector:8888']
# OTel Prometheus exporter
- job_name: 'otel-collector-prometheus-exporter'
scrape_interval: 15s
static_configs:
- targets: ['otel-collector:8889']
# Observability stack self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'grafana'
static_configs:
- targets: ['grafana:3000']
- job_name: 'loki'
static_configs:
- targets: ['loki:3100']
- job_name: 'tempo'
static_configs:
- targets: ['tempo:3200']
# Infrastructure services
- job_name: 'rabbitmq'
static_configs:
- targets: ['rabbitmq:15692']
Storage:
storage:
tsdb:
retention:
time: 15d # Keep metrics for 15 days
size: 10GB # Maximum storage size
wal_compression: true
Key Metrics
AppServer Metrics (when instrumented):
http_requests_total- Total HTTP requestshttp_request_duration_seconds- Request latency histogramhttp_requests_in_flight- Current concurrent requestsgrpc_server_handled_total- gRPC requests handledgrpc_server_handling_seconds- gRPC latency
Circuit Breaker Metrics (from pkg/v2/infrastructure/circuitbreaker/metrics.go):
circuit_breaker_state{name="upstream"} 0|1|2 # CLOSED|OPEN|HALF-OPEN
circuit_breaker_requests_total{name="upstream", result="success|failure|rejected"}
circuit_breaker_state_changes_total{name="upstream", from="closed", to="open"}
Event Bus Metrics:
rabbitmq_queue_messages- Queue depthrabbitmq_queue_consumers- Active consumerseventbus_messages_published_total- Published eventseventbus_messages_consumed_total- Consumed events
Database Metrics (via PostgreSQL exporter):
pg_stat_database_numbackends- Active connectionspg_stat_database_xact_commit- Committed transactionspg_locks_count- Lock count by type
Example Queries
Request Rate (last 5 minutes):
rate(http_requests_total[5m])
Error Rate:
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
95th Percentile Latency:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Circuit Breaker Open Count:
sum(circuit_breaker_state == 1)
Loki (Logs)
Based on docker/observability/loki/loki-config.yml:
Configuration
Storage:
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
filesystem:
directory: /loki/chunks
Retention:
limits_config:
retention_period: 720h # 30 days
compactor:
retention_enabled: true
retention_delete_delay: 2h
compaction_interval: 10m
Ingestion Limits:
limits_config:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
per_stream_rate_limit: 5MB
per_stream_rate_limit_burst: 10MB
Log Labels
Automatically Added by OTel Collector:
service.name- Service name (e.g., "appserver")service.namespace- Namespace (e.g., "appserver")level- Log level (debug, info, warn, error)deployment.environment- Environment (development, staging, production)
Custom Labels (application-defined):
app_name- Application nameuser_id- User identifierrequest_id- Request correlation IDtrace_id- Distributed trace ID
LogQL Queries
Filter by Service:
{service_name="appserver"}
Filter by Level:
{service_name="appserver"} |= "level=error"
Filter by Pattern:
{service_name="appserver"} |~ "circuit.*breaker.*open"
Extract and Count Errors:
sum(rate({service_name="appserver"} |= "level=error" [5m]))
Parse JSON and Filter:
{service_name="appserver"}
| json
| level="error"
| line_format "{{.timestamp}} {{.message}}"
Link Logs to Traces:
{service_name="appserver"} |= "trace_id=abc123"
Structured Logging Best Practices
Use Structured Fields:
logger.Info("User registered",
telemetry.String("user_id", userID),
telemetry.String("email", email),
telemetry.String("trace_id", traceID),
)
Include Context:
- Request ID for correlation
- Trace ID for distributed tracing
- User ID for audit trails
- App name for multi-tenant logs
Log Levels:
DEBUG- Detailed information for developmentINFO- General operational eventsWARN- Warning conditions (recoverable errors)ERROR- Error conditions (failed operations)
Tempo (Distributed Tracing)
Based on docker/observability/tempo/tempo-config.yml:
Configuration
Receivers:
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
Storage:
storage:
trace:
backend: local
local:
path: /tmp/tempo/blocks
compactor:
compaction:
block_retention: 720h # 30 days
Metrics Generation:
metrics_generator:
storage:
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true
processors:
- service-graphs # Service dependency graphs
- span-metrics # RED metrics from spans
Trace Structure
Span Attributes:
http.method- HTTP method (GET, POST, etc.)http.url- Request URLhttp.status_code- Response statusdb.system- Database type (postgresql, redis)db.statement- SQL querymessaging.system- Message broker (rabbitmq)error- Boolean error flagerror.message- Error details
Service Graph:
- Automatically generated from trace data
- Shows service-to-service dependencies
- Visualizes request flows
Example Traces
HTTP Request Trace:
Span: HTTP GET /api/apps/todos/items
├─ Span: Permission Check (OpenFGA)
├─ Span: Database Query (PostgreSQL)
│ └─ Span: SELECT * FROM items WHERE app_id = ?
└─ Span: HTTP Proxy to Backend
└─ Span: Backend Processing
Event Publishing Trace:
Span: Publish Event (app.installed)
├─ Span: RabbitMQ Publish
├─ Span: Consumer: Orchestrator
│ ├─ Span: Resolve Dependencies
│ ├─ Span: Pull Docker Image
│ └─ Span: Start Container
└─ Span: Consumer: Permission Invalidation
└─ Span: Redis DELETE
Querying Traces
By Trace ID:
Query: <trace-id>
By Service Name:
{service.name="appserver"}
By HTTP Status:
{http.status_code=500}
By Duration:
{duration > 1s}
Complex Query:
{service.name="appserver" && http.method="POST" && http.status_code=200 && duration > 500ms}
Grafana (Visualization)
Based on docker/observability/grafana/provisioning/datasources/datasources.yml:
Auto-Provisioned Data Sources
Prometheus (Default):
- URL:
http://prometheus:9090 - Query timeout: 60s
- Linked to Tempo for exemplars
Loki:
- URL:
http://loki:3100 - Max lines: 1000
- Linked to Tempo for trace correlation
Tempo:
- URL:
http://tempo:3200 - Linked to Prometheus for metrics
- Linked to Loki for logs
Access
URL: http://localhost:3000
Default Credentials: admin/admin
WARNING: Change default password in production!
Key Dashboards
AppServer Overview:
- Request rate (RPS)
- Error rate (%)
- Response time (p50, p95, p99)
- Active connections
- Circuit breaker states
Infrastructure Health:
- Database connections and query rate
- Redis cache hit ratio
- RabbitMQ queue depth and consumer count
- Container resource usage (CPU, memory)
Event Bus Monitoring:
- Messages published/consumed per second
- Queue depth by queue name
- Consumer lag
- Failed deliveries
Permission System:
- Permission check rate
- Cache hit ratio (Local, Redis, OpenFGA)
- OpenFGA query latency
- Cache invalidation events
Example Dashboard Panels
Request Rate:
sum(rate(http_requests_total[5m])) by (method, status)
Error Budget:
1 - (
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)
Cache Hit Ratio:
(
sum(rate(permission_cache_hits{level="local"}[5m]))
+
sum(rate(permission_cache_hits{level="redis"}[5m]))
)
/
sum(rate(permission_cache_requests_total[5m]))
Integration and Correlation
Metrics → Traces (Exemplars)
Prometheus can link metrics to example traces:
rate(http_request_duration_seconds_bucket[5m])
Click on a point → View exemplar trace → Opens trace in Tempo
Logs → Traces
Loki automatically detects trace IDs in logs:
{service_name="appserver"} |= "trace_id"
Click on trace ID → Opens trace in Tempo
Traces → Logs
From a trace span, click "View Logs" → Opens logs in Loki filtered by:
- Time range: ±1 hour from span
- Service name
- Trace ID
Traces → Metrics
From a trace, click "View Metrics" → Opens Prometheus filtered by:
- Time range: ±1 hour from span
- Service name
- HTTP method, status code
Instrumenting Applications
Go Applications
Install Dependencies:
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
go get go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc
Initialize Tracer:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() (*trace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(context.Background(),
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
semconv.ServiceNamespaceKey.String("appserver"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
Create Spans:
import "go.opentelemetry.io/otel"
func processRequest(ctx context.Context) error {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "process_request")
defer span.End()
// Add attributes
span.SetAttributes(
attribute.String("user.id", userID),
attribute.Int("item.count", count),
)
// Add event
span.AddEvent("processing_started")
// Record error
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "processing failed")
return err
}
span.SetStatus(codes.Ok, "success")
return nil
}
TypeScript/Node.js Applications
Install Dependencies:
npm install @opentelemetry/sdk-node
npm install @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/exporter-trace-otlp-grpc
Initialize Instrumentation:
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317',
}),
instrumentations: [getNodeAutoInstrumentations()],
serviceName: 'my-service',
});
sdk.start();
Environment Configuration
AppServer Configuration
# Enable telemetry
APPSERVER_METRICS_ENABLED=true
APPSERVER_TRACING_ENABLED=true
APPSERVER_LOG_LEVEL=info
# OTel endpoint
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_SERVICE_NAME=appserver
OTEL_SERVICE_NAMESPACE=appserver
Production Configuration
# Reduced log verbosity
APPSERVER_LOG_LEVEL=warn
# Enable all telemetry
APPSERVER_METRICS_ENABLED=true
APPSERVER_TRACING_ENABLED=true
# Deployment environment
DEPLOYMENT_ENVIRONMENT=production
# OTel sampling (reduce overhead)
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1 # Sample 10% of traces
Alerting
Prometheus Alerting Rules
Create docker/observability/prometheus/rules/appserver.yml:
groups:
- name: appserver
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.service }}"
description: "P95 latency is {{ $value }}s"
# Circuit breaker open
- alert: CircuitBreakerOpen
expr: |
sum(circuit_breaker_state == 1) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Circuit breaker open for {{ $labels.name }}"
description: "Circuit breaker has been open for 5 minutes"
# Database connection pool exhausted
- alert: DatabaseConnectionPoolExhausted
expr: |
pg_stat_database_numbackends / pg_settings_max_connections > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Database connection pool nearly exhausted"
description: "Using {{ $value | humanizePercentage }} of connections"
Loki Alerting Rules
Create docker/observability/loki/rules/appserver.yml:
groups:
- name: appserver-logs
interval: 1m
rules:
# High error log rate
- alert: HighErrorLogRate
expr: |
sum(rate({service_name="appserver"} |= "level=error" [5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error log rate"
description: "More than 10 errors per second"
# Critical errors
- alert: CriticalError
expr: |
sum(count_over_time({service_name="appserver"} |= "panic" [5m])) > 0
labels:
severity: critical
annotations:
summary: "Critical error detected"
description: "Panic detected in logs"
Best Practices
Metrics
✅ DO:
- Use RED metrics (Rate, Errors, Duration) for services
- Use USE metrics (Utilization, Saturation, Errors) for resources
- Add labels for grouping (service, method, status)
- Use histograms for latency (not gauges)
- Monitor circuit breaker states
❌ DON'T:
- Use high-cardinality labels (user IDs, trace IDs)
- Create too many metrics (causes performance issues)
- Skip units in metric names
Logs
✅ DO:
- Use structured logging (JSON)
- Include trace ID for correlation
- Add contextual fields (user_id, request_id)
- Use appropriate log levels
- Log errors with stack traces
❌ DON'T:
- Log sensitive data (passwords, tokens)
- Log at DEBUG level in production
- Use unstructured text logs
Traces
✅ DO:
- Trace external service calls
- Trace database queries
- Add meaningful span names
- Include error details in spans
- Use sampling in high-traffic systems
❌ DON'T:
- Trace every function (too granular)
- Include sensitive data in span attributes
- Skip error recording
Dashboards
✅ DO:
- Create service-level dashboards
- Include SLI/SLO metrics
- Link to related dashboards
- Use consistent time ranges
- Add annotations for deploys
❌ DON'T:
- Create dashboards without a purpose
- Use too many panels (information overload)
- Skip documentation
Troubleshooting
High Cardinality Issues
Symptom: Prometheus out of memory, slow queries
Solution:
# Check cardinality
curl http://localhost:9090/api/v1/status/tsdb
# Identify high-cardinality metrics
curl http://localhost:9090/api/v1/label/__name__/values
# Drop high-cardinality labels
metric_drop_re: "user_id|trace_id|request_id"
Missing Traces
Check OTel Collector:
# View collector logs
docker logs otel-collector
# Check receiver endpoint
curl http://localhost:4318/v1/traces
Check Tempo:
# View Tempo logs
docker logs tempo
# Check ingestion
curl http://localhost:3200/metrics | grep tempo_ingester
Log Ingestion Failures
Check Rate Limits:
# View Loki metrics
curl http://localhost:3100/metrics | grep loki_request
# Increase limits in loki-config.yml
limits_config:
ingestion_rate_mb: 20
ingestion_burst_size_mb: 40
Code References
| Component | File | Purpose |
|---|---|---|
| OTel Collector | docker/observability/otel-collector/otel-config.yml | Collector configuration |
| Prometheus | docker/observability/prometheus/prometheus.yml | Metrics scraping |
| Loki | docker/observability/loki/loki-config.yml | Log aggregation |
| Tempo | docker/observability/tempo/tempo-config.yml | Trace storage |
| Grafana | docker/observability/grafana/provisioning/ | Datasources and dashboards |
Related Topics
- Environment Variable Reference - Observability configuration
- Local Development Stack - Running observability stack locally
- Platform Architecture - Infrastructure layer