Skip to main content

Observability Stack

Complete observability setup with OpenTelemetry, Prometheus, Loki, Tempo, and Grafana for metrics, logs, and distributed tracing.

Overview

The Easy AppServer observability stack follows the three pillars of observability:

Metrics (Prometheus):

  • Time-series data for monitoring system health
  • Aggregated statistics and trends
  • Alerting based on thresholds
  • RED metrics (Rate, Errors, Duration)

Logs (Loki):

  • Event records with context and details
  • Debugging and troubleshooting
  • Audit trails
  • Structured logging with labels

Traces (Tempo):

  • End-to-end request flows across services
  • Performance bottleneck identification
  • Service dependency mapping
  • Correlation with metrics and logs

Architecture

┌─────────────────────────────────────────────────────────────┐
│ Applications │
│ (AppServer, Shell, Apps, Infrastructure Services) │
└─────────────────┬───────────────────────────────────────────┘
│ OTLP (OpenTelemetry Protocol)
│ gRPC:4317 / HTTP:4318

┌─────────────────────────────────────────────────────────────┐
│ OpenTelemetry Collector │
│ Receivers → Processors → Exporters │
└─────┬────────────────┬──────────────────┬──────────────────┘
│ Metrics │ Logs │ Traces
↓ ↓ ↓
┌──────────┐ ┌──────────┐ ┌──────────┐
│Prometheus│ │ Loki │ │ Tempo │
│ :9090 │ │ :3100 │ │ :3200 │
└──────────┘ └──────────┘ └──────────┘
└────────────┬──────────────────┘

┌──────────────┐
│ Grafana │
│ :3000 │
└──────────────┘

OpenTelemetry Collector

Based on docker/observability/otel-collector/otel-config.yml:

Purpose

The OpenTelemetry Collector is the central telemetry hub that:

  • Receives telemetry data from applications via OTLP
  • Processes and enriches telemetry with additional metadata
  • Routes telemetry to appropriate backends (Prometheus, Loki, Tempo)
  • Provides vendor-agnostic instrumentation

Configuration

Receivers:

receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317 # gRPC receiver
http:
endpoint: 0.0.0.0:4318 # HTTP receiver

Processors:

processors:
# Batch telemetry to reduce connections
batch:
timeout: 10s
send_batch_size: 1024

# Add deployment metadata
resource:
attributes:
- key: service.namespace
value: "appserver"
- key: deployment.environment
value: "${DEPLOYMENT_ENVIRONMENT:-development}"

# Prevent OOM
memory_limiter:
check_interval: 1s
limit_percentage: 80

Exporters:

exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: appserver

loki:
endpoint: "http://loki:3100/loki/api/v1/push"

otlp/tempo:
endpoint: "tempo:4317"

Pipelines

Traces Pipeline:

OTLP Receiver → Memory Limiter → Batch → Resource → Tempo

Metrics Pipeline:

OTLP Receiver → Memory Limiter → Batch → Resource → Prometheus

Logs Pipeline:

OTLP Receiver → Memory Limiter → Batch → Resource → Loki

Prometheus (Metrics)

Based on docker/observability/prometheus/prometheus.yml:

Configuration

Scrape Targets:

scrape_configs:
# OpenTelemetry Collector metrics
- job_name: 'otel-collector'
scrape_interval: 10s
static_configs:
- targets: ['otel-collector:8888']

# OTel Prometheus exporter
- job_name: 'otel-collector-prometheus-exporter'
scrape_interval: 15s
static_configs:
- targets: ['otel-collector:8889']

# Observability stack self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'grafana'
static_configs:
- targets: ['grafana:3000']

- job_name: 'loki'
static_configs:
- targets: ['loki:3100']

- job_name: 'tempo'
static_configs:
- targets: ['tempo:3200']

# Infrastructure services
- job_name: 'rabbitmq'
static_configs:
- targets: ['rabbitmq:15692']

Storage:

storage:
tsdb:
retention:
time: 15d # Keep metrics for 15 days
size: 10GB # Maximum storage size
wal_compression: true

Key Metrics

AppServer Metrics (when instrumented):

  • http_requests_total - Total HTTP requests
  • http_request_duration_seconds - Request latency histogram
  • http_requests_in_flight - Current concurrent requests
  • grpc_server_handled_total - gRPC requests handled
  • grpc_server_handling_seconds - gRPC latency

Circuit Breaker Metrics (from pkg/v2/infrastructure/circuitbreaker/metrics.go):

circuit_breaker_state{name="upstream"} 0|1|2  # CLOSED|OPEN|HALF-OPEN
circuit_breaker_requests_total{name="upstream", result="success|failure|rejected"}
circuit_breaker_state_changes_total{name="upstream", from="closed", to="open"}

Event Bus Metrics:

  • rabbitmq_queue_messages - Queue depth
  • rabbitmq_queue_consumers - Active consumers
  • eventbus_messages_published_total - Published events
  • eventbus_messages_consumed_total - Consumed events

Database Metrics (via PostgreSQL exporter):

  • pg_stat_database_numbackends - Active connections
  • pg_stat_database_xact_commit - Committed transactions
  • pg_locks_count - Lock count by type

Example Queries

Request Rate (last 5 minutes):

rate(http_requests_total[5m])

Error Rate:

rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])

95th Percentile Latency:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Circuit Breaker Open Count:

sum(circuit_breaker_state == 1)

Loki (Logs)

Based on docker/observability/loki/loki-config.yml:

Configuration

Storage:

schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13

storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
filesystem:
directory: /loki/chunks

Retention:

limits_config:
retention_period: 720h # 30 days

compactor:
retention_enabled: true
retention_delete_delay: 2h
compaction_interval: 10m

Ingestion Limits:

limits_config:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
per_stream_rate_limit: 5MB
per_stream_rate_limit_burst: 10MB

Log Labels

Automatically Added by OTel Collector:

  • service.name - Service name (e.g., "appserver")
  • service.namespace - Namespace (e.g., "appserver")
  • level - Log level (debug, info, warn, error)
  • deployment.environment - Environment (development, staging, production)

Custom Labels (application-defined):

  • app_name - Application name
  • user_id - User identifier
  • request_id - Request correlation ID
  • trace_id - Distributed trace ID

LogQL Queries

Filter by Service:

{service_name="appserver"}

Filter by Level:

{service_name="appserver"} |= "level=error"

Filter by Pattern:

{service_name="appserver"} |~ "circuit.*breaker.*open"

Extract and Count Errors:

sum(rate({service_name="appserver"} |= "level=error" [5m]))

Parse JSON and Filter:

{service_name="appserver"}
| json
| level="error"
| line_format "{{.timestamp}} {{.message}}"

Link Logs to Traces:

{service_name="appserver"} |= "trace_id=abc123"

Structured Logging Best Practices

Use Structured Fields:

logger.Info("User registered",
telemetry.String("user_id", userID),
telemetry.String("email", email),
telemetry.String("trace_id", traceID),
)

Include Context:

  • Request ID for correlation
  • Trace ID for distributed tracing
  • User ID for audit trails
  • App name for multi-tenant logs

Log Levels:

  • DEBUG - Detailed information for development
  • INFO - General operational events
  • WARN - Warning conditions (recoverable errors)
  • ERROR - Error conditions (failed operations)

Tempo (Distributed Tracing)

Based on docker/observability/tempo/tempo-config.yml:

Configuration

Receivers:

distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

Storage:

storage:
trace:
backend: local
local:
path: /tmp/tempo/blocks

compactor:
compaction:
block_retention: 720h # 30 days

Metrics Generation:

metrics_generator:
storage:
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true

processors:
- service-graphs # Service dependency graphs
- span-metrics # RED metrics from spans

Trace Structure

Span Attributes:

  • http.method - HTTP method (GET, POST, etc.)
  • http.url - Request URL
  • http.status_code - Response status
  • db.system - Database type (postgresql, redis)
  • db.statement - SQL query
  • messaging.system - Message broker (rabbitmq)
  • error - Boolean error flag
  • error.message - Error details

Service Graph:

  • Automatically generated from trace data
  • Shows service-to-service dependencies
  • Visualizes request flows

Example Traces

HTTP Request Trace:

Span: HTTP GET /api/apps/todos/items
├─ Span: Permission Check (OpenFGA)
├─ Span: Database Query (PostgreSQL)
│ └─ Span: SELECT * FROM items WHERE app_id = ?
└─ Span: HTTP Proxy to Backend
└─ Span: Backend Processing

Event Publishing Trace:

Span: Publish Event (app.installed)
├─ Span: RabbitMQ Publish
├─ Span: Consumer: Orchestrator
│ ├─ Span: Resolve Dependencies
│ ├─ Span: Pull Docker Image
│ └─ Span: Start Container
└─ Span: Consumer: Permission Invalidation
└─ Span: Redis DELETE

Querying Traces

By Trace ID:

Query: <trace-id>

By Service Name:

{service.name="appserver"}

By HTTP Status:

{http.status_code=500}

By Duration:

{duration > 1s}

Complex Query:

{service.name="appserver" && http.method="POST" && http.status_code=200 && duration > 500ms}

Grafana (Visualization)

Based on docker/observability/grafana/provisioning/datasources/datasources.yml:

Auto-Provisioned Data Sources

Prometheus (Default):

  • URL: http://prometheus:9090
  • Query timeout: 60s
  • Linked to Tempo for exemplars

Loki:

  • URL: http://loki:3100
  • Max lines: 1000
  • Linked to Tempo for trace correlation

Tempo:

  • URL: http://tempo:3200
  • Linked to Prometheus for metrics
  • Linked to Loki for logs

Access

URL: http://localhost:3000
Default Credentials: admin/admin

WARNING: Change default password in production!

Key Dashboards

AppServer Overview:

  • Request rate (RPS)
  • Error rate (%)
  • Response time (p50, p95, p99)
  • Active connections
  • Circuit breaker states

Infrastructure Health:

  • Database connections and query rate
  • Redis cache hit ratio
  • RabbitMQ queue depth and consumer count
  • Container resource usage (CPU, memory)

Event Bus Monitoring:

  • Messages published/consumed per second
  • Queue depth by queue name
  • Consumer lag
  • Failed deliveries

Permission System:

  • Permission check rate
  • Cache hit ratio (Local, Redis, OpenFGA)
  • OpenFGA query latency
  • Cache invalidation events

Example Dashboard Panels

Request Rate:

sum(rate(http_requests_total[5m])) by (method, status)

Error Budget:

1 - (
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)

Cache Hit Ratio:

(
sum(rate(permission_cache_hits{level="local"}[5m]))
+
sum(rate(permission_cache_hits{level="redis"}[5m]))
)
/
sum(rate(permission_cache_requests_total[5m]))

Integration and Correlation

Metrics → Traces (Exemplars)

Prometheus can link metrics to example traces:

rate(http_request_duration_seconds_bucket[5m])

Click on a point → View exemplar trace → Opens trace in Tempo

Logs → Traces

Loki automatically detects trace IDs in logs:

{service_name="appserver"} |= "trace_id"

Click on trace ID → Opens trace in Tempo

Traces → Logs

From a trace span, click "View Logs" → Opens logs in Loki filtered by:

  • Time range: ±1 hour from span
  • Service name
  • Trace ID

Traces → Metrics

From a trace, click "View Metrics" → Opens Prometheus filtered by:

  • Time range: ±1 hour from span
  • Service name
  • HTTP method, status code

Instrumenting Applications

Go Applications

Install Dependencies:

go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
go get go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc

Initialize Tracer:

import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() (*trace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(context.Background(),
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}

tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
semconv.ServiceNamespaceKey.String("appserver"),
)),
)

otel.SetTracerProvider(tp)
return tp, nil
}

Create Spans:

import "go.opentelemetry.io/otel"

func processRequest(ctx context.Context) error {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "process_request")
defer span.End()

// Add attributes
span.SetAttributes(
attribute.String("user.id", userID),
attribute.Int("item.count", count),
)

// Add event
span.AddEvent("processing_started")

// Record error
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "processing failed")
return err
}

span.SetStatus(codes.Ok, "success")
return nil
}

TypeScript/Node.js Applications

Install Dependencies:

npm install @opentelemetry/sdk-node
npm install @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/exporter-trace-otlp-grpc

Initialize Instrumentation:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317',
}),
instrumentations: [getNodeAutoInstrumentations()],
serviceName: 'my-service',
});

sdk.start();

Environment Configuration

AppServer Configuration

# Enable telemetry
APPSERVER_METRICS_ENABLED=true
APPSERVER_TRACING_ENABLED=true
APPSERVER_LOG_LEVEL=info

# OTel endpoint
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_SERVICE_NAME=appserver
OTEL_SERVICE_NAMESPACE=appserver

Production Configuration

# Reduced log verbosity
APPSERVER_LOG_LEVEL=warn

# Enable all telemetry
APPSERVER_METRICS_ENABLED=true
APPSERVER_TRACING_ENABLED=true

# Deployment environment
DEPLOYMENT_ENVIRONMENT=production

# OTel sampling (reduce overhead)
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1 # Sample 10% of traces

Alerting

Prometheus Alerting Rules

Create docker/observability/prometheus/rules/appserver.yml:

groups:
- name: appserver
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"

# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.service }}"
description: "P95 latency is {{ $value }}s"

# Circuit breaker open
- alert: CircuitBreakerOpen
expr: |
sum(circuit_breaker_state == 1) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Circuit breaker open for {{ $labels.name }}"
description: "Circuit breaker has been open for 5 minutes"

# Database connection pool exhausted
- alert: DatabaseConnectionPoolExhausted
expr: |
pg_stat_database_numbackends / pg_settings_max_connections > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Database connection pool nearly exhausted"
description: "Using {{ $value | humanizePercentage }} of connections"

Loki Alerting Rules

Create docker/observability/loki/rules/appserver.yml:

groups:
- name: appserver-logs
interval: 1m
rules:
# High error log rate
- alert: HighErrorLogRate
expr: |
sum(rate({service_name="appserver"} |= "level=error" [5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error log rate"
description: "More than 10 errors per second"

# Critical errors
- alert: CriticalError
expr: |
sum(count_over_time({service_name="appserver"} |= "panic" [5m])) > 0
labels:
severity: critical
annotations:
summary: "Critical error detected"
description: "Panic detected in logs"

Best Practices

Metrics

DO:

  • Use RED metrics (Rate, Errors, Duration) for services
  • Use USE metrics (Utilization, Saturation, Errors) for resources
  • Add labels for grouping (service, method, status)
  • Use histograms for latency (not gauges)
  • Monitor circuit breaker states

DON'T:

  • Use high-cardinality labels (user IDs, trace IDs)
  • Create too many metrics (causes performance issues)
  • Skip units in metric names

Logs

DO:

  • Use structured logging (JSON)
  • Include trace ID for correlation
  • Add contextual fields (user_id, request_id)
  • Use appropriate log levels
  • Log errors with stack traces

DON'T:

  • Log sensitive data (passwords, tokens)
  • Log at DEBUG level in production
  • Use unstructured text logs

Traces

DO:

  • Trace external service calls
  • Trace database queries
  • Add meaningful span names
  • Include error details in spans
  • Use sampling in high-traffic systems

DON'T:

  • Trace every function (too granular)
  • Include sensitive data in span attributes
  • Skip error recording

Dashboards

DO:

  • Create service-level dashboards
  • Include SLI/SLO metrics
  • Link to related dashboards
  • Use consistent time ranges
  • Add annotations for deploys

DON'T:

  • Create dashboards without a purpose
  • Use too many panels (information overload)
  • Skip documentation

Troubleshooting

High Cardinality Issues

Symptom: Prometheus out of memory, slow queries

Solution:

# Check cardinality
curl http://localhost:9090/api/v1/status/tsdb

# Identify high-cardinality metrics
curl http://localhost:9090/api/v1/label/__name__/values

# Drop high-cardinality labels
metric_drop_re: "user_id|trace_id|request_id"

Missing Traces

Check OTel Collector:

# View collector logs
docker logs otel-collector

# Check receiver endpoint
curl http://localhost:4318/v1/traces

Check Tempo:

# View Tempo logs
docker logs tempo

# Check ingestion
curl http://localhost:3200/metrics | grep tempo_ingester

Log Ingestion Failures

Check Rate Limits:

# View Loki metrics
curl http://localhost:3100/metrics | grep loki_request

# Increase limits in loki-config.yml
limits_config:
ingestion_rate_mb: 20
ingestion_burst_size_mb: 40

Code References

ComponentFilePurpose
OTel Collectordocker/observability/otel-collector/otel-config.ymlCollector configuration
Prometheusdocker/observability/prometheus/prometheus.ymlMetrics scraping
Lokidocker/observability/loki/loki-config.ymlLog aggregation
Tempodocker/observability/tempo/tempo-config.ymlTrace storage
Grafanadocker/observability/grafana/provisioning/Datasources and dashboards