Observability Stack

Complete observability setup with OpenTelemetry, Prometheus, Loki, Tempo, and Grafana for metrics, logs, and distributed tracing.

Overview

The Easy AppServer observability stack follows the three pillars of observability:

Metrics (Prometheus):

Time-series data for monitoring system health
Aggregated statistics and trends
Alerting based on thresholds
RED metrics (Rate, Errors, Duration)

Logs (Loki):

Event records with context and details
Debugging and troubleshooting
Audit trails
Structured logging with labels

Traces (Tempo):

End-to-end request flows across services
Performance bottleneck identification
Service dependency mapping
Correlation with metrics and logs

Architecture

┌─────────────────────────────────────────────────────────────┐
│                       Applications                           │
│   (AppServer, Shell, Apps, Infrastructure Services)         │
└─────────────────┬───────────────────────────────────────────┘
                  │ OTLP (OpenTelemetry Protocol)
                  │ gRPC:4317 / HTTP:4318
                  ↓
┌─────────────────────────────────────────────────────────────┐
│               OpenTelemetry Collector                        │
│   Receivers → Processors → Exporters                        │
└─────┬────────────────┬──────────────────┬──────────────────┘
      │ Metrics        │ Logs             │ Traces
      ↓                ↓                  ↓
┌──────────┐    ┌──────────┐      ┌──────────┐
│Prometheus│    │   Loki   │      │  Tempo   │
│  :9090   │    │  :3100   │      │  :3200   │
└──────────┘    └──────────┘      └──────────┘
      └────────────┬──────────────────┘
                   ↓
            ┌──────────────┐
            │   Grafana    │
            │    :3000     │
            └──────────────┘

OpenTelemetry Collector

Based on docker/observability/otel-collector/otel-config.yml:

Purpose

The OpenTelemetry Collector is the central telemetry hub that:

Receives telemetry data from applications via OTLP
Processes and enriches telemetry with additional metadata
Routes telemetry to appropriate backends (Prometheus, Loki, Tempo)
Provides vendor-agnostic instrumentation

Configuration

Receivers:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317  # gRPC receiver
      http:
        endpoint: 0.0.0.0:4318  # HTTP receiver

Processors:

processors:
  # Batch telemetry to reduce connections
  batch:
    timeout: 10s
    send_batch_size: 1024

  # Add deployment metadata
  resource:
    attributes:
      - key: service.namespace
        value: "appserver"
      - key: deployment.environment
        value: "${DEPLOYMENT_ENVIRONMENT:-development}"

  # Prevent OOM
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80

Exporters:

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: appserver

  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

  otlp/tempo:
    endpoint: "tempo:4317"

Pipelines

Traces Pipeline:

OTLP Receiver → Memory Limiter → Batch → Resource → Tempo

Metrics Pipeline:

OTLP Receiver → Memory Limiter → Batch → Resource → Prometheus

Logs Pipeline:

OTLP Receiver → Memory Limiter → Batch → Resource → Loki

Prometheus (Metrics)

Based on docker/observability/prometheus/prometheus.yml:

Configuration

Scrape Targets:

scrape_configs:
  # OpenTelemetry Collector metrics
  - job_name: 'otel-collector'
    scrape_interval: 10s
    static_configs:
      - targets: ['otel-collector:8888']

  # OTel Prometheus exporter
  - job_name: 'otel-collector-prometheus-exporter'
    scrape_interval: 15s
    static_configs:
      - targets: ['otel-collector:8889']

  # Observability stack self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'grafana'
    static_configs:
      - targets: ['grafana:3000']

  - job_name: 'loki'
    static_configs:
      - targets: ['loki:3100']

  - job_name: 'tempo'
    static_configs:
      - targets: ['tempo:3200']

  # Infrastructure services
  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['rabbitmq:15692']

Storage:

storage:
  tsdb:
    retention:
      time: 15d       # Keep metrics for 15 days
      size: 10GB      # Maximum storage size
    wal_compression: true

Key Metrics

AppServer Metrics (when instrumented):

http_requests_total - Total HTTP requests
http_request_duration_seconds - Request latency histogram
http_requests_in_flight - Current concurrent requests
grpc_server_handled_total - gRPC requests handled
grpc_server_handling_seconds - gRPC latency

Circuit Breaker Metrics (from pkg/v2/infrastructure/circuitbreaker/metrics.go):

circuit_breaker_state{name="upstream"} 0|1|2  # CLOSED|OPEN|HALF-OPEN
circuit_breaker_requests_total{name="upstream", result="success|failure|rejected"}
circuit_breaker_state_changes_total{name="upstream", from="closed", to="open"}

Event Bus Metrics:

rabbitmq_queue_messages - Queue depth
rabbitmq_queue_consumers - Active consumers
eventbus_messages_published_total - Published events
eventbus_messages_consumed_total - Consumed events

Database Metrics (via PostgreSQL exporter):

pg_stat_database_numbackends - Active connections
pg_stat_database_xact_commit - Committed transactions
pg_locks_count - Lock count by type

Example Queries

Request Rate (last 5 minutes):

rate(http_requests_total[5m])

Error Rate:

rate(http_requests_total{status=~"5.."}[5m])
  /
rate(http_requests_total[5m])

95th Percentile Latency:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Circuit Breaker Open Count:

sum(circuit_breaker_state == 1)

Loki (Logs)

Based on docker/observability/loki/loki-config.yml:

Configuration

Storage:

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
  filesystem:
    directory: /loki/chunks

Retention:

limits_config:
  retention_period: 720h  # 30 days

compactor:
  retention_enabled: true
  retention_delete_delay: 2h
  compaction_interval: 10m

Ingestion Limits:

limits_config:
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  per_stream_rate_limit: 5MB
  per_stream_rate_limit_burst: 10MB

Log Labels

Automatically Added by OTel Collector:

service.name - Service name (e.g., "appserver")
service.namespace - Namespace (e.g., "appserver")
level - Log level (debug, info, warn, error)
deployment.environment - Environment (development, staging, production)

Custom Labels (application-defined):

app_name - Application name
user_id - User identifier
request_id - Request correlation ID
trace_id - Distributed trace ID

LogQL Queries

Filter by Service:

{service_name="appserver"}

Filter by Level:

{service_name="appserver"} |= "level=error"

Filter by Pattern:

{service_name="appserver"} |~ "circuit.*breaker.*open"

Extract and Count Errors:

sum(rate({service_name="appserver"} |= "level=error" [5m]))

Parse JSON and Filter:

{service_name="appserver"}
  | json
  | level="error"
  | line_format "{{.timestamp}} {{.message}}"

Link Logs to Traces:

{service_name="appserver"} |= "trace_id=abc123"

Structured Logging Best Practices

Use Structured Fields:

logger.Info("User registered",
    telemetry.String("user_id", userID),
    telemetry.String("email", email),
    telemetry.String("trace_id", traceID),
)

Include Context:

Request ID for correlation
Trace ID for distributed tracing
User ID for audit trails
App name for multi-tenant logs

Log Levels:

DEBUG - Detailed information for development
INFO - General operational events
WARN - Warning conditions (recoverable errors)
ERROR - Error conditions (failed operations)

Tempo (Distributed Tracing)

Based on docker/observability/tempo/tempo-config.yml:

Configuration

Receivers:

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

Storage:

storage:
  trace:
    backend: local
    local:
      path: /tmp/tempo/blocks

compactor:
  compaction:
    block_retention: 720h  # 30 days

Metrics Generation:

metrics_generator:
  storage:
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true

  processors:
    - service-graphs    # Service dependency graphs
    - span-metrics      # RED metrics from spans

Trace Structure

Span Attributes:

http.method - HTTP method (GET, POST, etc.)
http.url - Request URL
http.status_code - Response status
db.system - Database type (postgresql, redis)
db.statement - SQL query
messaging.system - Message broker (rabbitmq)
error - Boolean error flag
error.message - Error details

Service Graph:

Automatically generated from trace data
Shows service-to-service dependencies
Visualizes request flows

Example Traces

HTTP Request Trace:

Span: HTTP GET /api/apps/todos/items
  ├─ Span: Permission Check (OpenFGA)
  ├─ Span: Database Query (PostgreSQL)
  │   └─ Span: SELECT * FROM items WHERE app_id = ?
  └─ Span: HTTP Proxy to Backend
      └─ Span: Backend Processing

Event Publishing Trace:

Span: Publish Event (app.installed)
  ├─ Span: RabbitMQ Publish
  ├─ Span: Consumer: Orchestrator
  │   ├─ Span: Resolve Dependencies
  │   ├─ Span: Pull Docker Image
  │   └─ Span: Start Container
  └─ Span: Consumer: Permission Invalidation
      └─ Span: Redis DELETE

Querying Traces

By Trace ID:

Query: <trace-id>

By Service Name:

{service.name="appserver"}

By HTTP Status:

{http.status_code=500}

By Duration:

{duration > 1s}

Complex Query:

{service.name="appserver" && http.method="POST" && http.status_code=200 && duration > 500ms}

Grafana (Visualization)

Based on docker/observability/grafana/provisioning/datasources/datasources.yml:

Auto-Provisioned Data Sources

Prometheus (Default):

URL: http://prometheus:9090
Query timeout: 60s
Linked to Tempo for exemplars

Loki:

URL: http://loki:3100
Max lines: 1000
Linked to Tempo for trace correlation

Tempo:

URL: http://tempo:3200
Linked to Prometheus for metrics
Linked to Loki for logs

Access

URL: http://localhost:3000
Default Credentials: admin/admin

WARNING: Change default password in production!

Key Dashboards

AppServer Overview:

Request rate (RPS)
Error rate (%)
Response time (p50, p95, p99)
Active connections
Circuit breaker states

Infrastructure Health:

Database connections and query rate
Redis cache hit ratio
RabbitMQ queue depth and consumer count
Container resource usage (CPU, memory)

Event Bus Monitoring:

Messages published/consumed per second
Queue depth by queue name
Consumer lag
Failed deliveries

Permission System:

Permission check rate
Cache hit ratio (Local, Redis, OpenFGA)
OpenFGA query latency
Cache invalidation events

Example Dashboard Panels

Request Rate:

sum(rate(http_requests_total[5m])) by (method, status)

Error Budget:

1 - (
  sum(rate(http_requests_total{status=~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
)

Cache Hit Ratio:

(
  sum(rate(permission_cache_hits{level="local"}[5m]))
  +
  sum(rate(permission_cache_hits{level="redis"}[5m]))
)
/
sum(rate(permission_cache_requests_total[5m]))

Integration and Correlation

Metrics → Traces (Exemplars)

Prometheus can link metrics to example traces:

rate(http_request_duration_seconds_bucket[5m])

Click on a point → View exemplar trace → Opens trace in Tempo

Logs → Traces

Loki automatically detects trace IDs in logs:

{service_name="appserver"} |= "trace_id"

Click on trace ID → Opens trace in Tempo

Traces → Logs

From a trace span, click "View Logs" → Opens logs in Loki filtered by:

Time range: ±1 hour from span
Service name
Trace ID

Traces → Metrics

From a trace, click "View Metrics" → Opens Prometheus filtered by:

Time range: ±1 hour from span
Service name
HTTP method, status code

Instrumenting Applications

Go Applications

Install Dependencies:

go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
go get go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc

Initialize Tracer:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() (*trace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(context.Background(),
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("my-service"),
            semconv.ServiceNamespaceKey.String("appserver"),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

Create Spans:

import "go.opentelemetry.io/otel"

func processRequest(ctx context.Context) error {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "process_request")
    defer span.End()

    // Add attributes
    span.SetAttributes(
        attribute.String("user.id", userID),
        attribute.Int("item.count", count),
    )

    // Add event
    span.AddEvent("processing_started")

    // Record error
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "processing failed")
        return err
    }

    span.SetStatus(codes.Ok, "success")
    return nil
}

TypeScript/Node.js Applications

Install Dependencies:

npm install @opentelemetry/sdk-node
npm install @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/exporter-trace-otlp-grpc

Initialize Instrumentation:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: 'my-service',
});

sdk.start();

Environment Configuration

AppServer Configuration

# Enable telemetry
APPSERVER_METRICS_ENABLED=true
APPSERVER_TRACING_ENABLED=true
APPSERVER_LOG_LEVEL=info

# OTel endpoint
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_SERVICE_NAME=appserver
OTEL_SERVICE_NAMESPACE=appserver

Production Configuration

# Reduced log verbosity
APPSERVER_LOG_LEVEL=warn

# Enable all telemetry
APPSERVER_METRICS_ENABLED=true
APPSERVER_TRACING_ENABLED=true

# Deployment environment
DEPLOYMENT_ENVIRONMENT=production

# OTel sampling (reduce overhead)
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1  # Sample 10% of traces

Alerting

Prometheus Alerting Rules

Create docker/observability/prometheus/rules/appserver.yml:

groups:
  - name: appserver
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          /
          rate(http_requests_total[5m])
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "P95 latency is {{ $value }}s"

      # Circuit breaker open
      - alert: CircuitBreakerOpen
        expr: |
          sum(circuit_breaker_state == 1) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker open for {{ $labels.name }}"
          description: "Circuit breaker has been open for 5 minutes"

      # Database connection pool exhausted
      - alert: DatabaseConnectionPoolExhausted
        expr: |
          pg_stat_database_numbackends / pg_settings_max_connections > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool nearly exhausted"
          description: "Using {{ $value | humanizePercentage }} of connections"

Loki Alerting Rules

Create docker/observability/loki/rules/appserver.yml:

groups:
  - name: appserver-logs
    interval: 1m
    rules:
      # High error log rate
      - alert: HighErrorLogRate
        expr: |
          sum(rate({service_name="appserver"} |= "level=error" [5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error log rate"
          description: "More than 10 errors per second"

      # Critical errors
      - alert: CriticalError
        expr: |
          sum(count_over_time({service_name="appserver"} |= "panic" [5m])) > 0
        labels:
          severity: critical
        annotations:
          summary: "Critical error detected"
          description: "Panic detected in logs"

Best Practices

Metrics

✅ DO:

Use RED metrics (Rate, Errors, Duration) for services
Use USE metrics (Utilization, Saturation, Errors) for resources
Add labels for grouping (service, method, status)
Use histograms for latency (not gauges)
Monitor circuit breaker states

❌ DON'T:

Use high-cardinality labels (user IDs, trace IDs)
Create too many metrics (causes performance issues)
Skip units in metric names

Logs

✅ DO:

Use structured logging (JSON)
Include trace ID for correlation
Add contextual fields (user_id, request_id)
Use appropriate log levels
Log errors with stack traces

❌ DON'T:

Log sensitive data (passwords, tokens)
Log at DEBUG level in production
Use unstructured text logs

Traces

✅ DO:

Trace external service calls
Trace database queries
Add meaningful span names
Include error details in spans
Use sampling in high-traffic systems

❌ DON'T:

Trace every function (too granular)
Include sensitive data in span attributes
Skip error recording

Dashboards

✅ DO:

Create service-level dashboards
Include SLI/SLO metrics
Link to related dashboards
Use consistent time ranges
Add annotations for deploys

❌ DON'T:

Create dashboards without a purpose
Use too many panels (information overload)
Skip documentation

Troubleshooting

High Cardinality Issues

Symptom: Prometheus out of memory, slow queries

Solution:

# Check cardinality
curl http://localhost:9090/api/v1/status/tsdb

# Identify high-cardinality metrics
curl http://localhost:9090/api/v1/label/__name__/values

# Drop high-cardinality labels
metric_drop_re: "user_id|trace_id|request_id"

Missing Traces

Check OTel Collector:

# View collector logs
docker logs otel-collector

# Check receiver endpoint
curl http://localhost:4318/v1/traces

Check Tempo:

# View Tempo logs
docker logs tempo

# Check ingestion
curl http://localhost:3200/metrics | grep tempo_ingester

Log Ingestion Failures

Check Rate Limits:

# View Loki metrics
curl http://localhost:3100/metrics | grep loki_request

# Increase limits in loki-config.yml
limits_config:
  ingestion_rate_mb: 20
  ingestion_burst_size_mb: 40

Code References

Component	File	Purpose
OTel Collector	`docker/observability/otel-collector/otel-config.yml`	Collector configuration
Prometheus	`docker/observability/prometheus/prometheus.yml`	Metrics scraping
Loki	`docker/observability/loki/loki-config.yml`	Log aggregation
Tempo	`docker/observability/tempo/tempo-config.yml`	Trace storage
Grafana	`docker/observability/grafana/provisioning/`	Datasources and dashboards

Environment Variable Reference - Observability configuration
Local Development Stack - Running observability stack locally
Platform Architecture - Infrastructure layer

Overview​

Architecture​

OpenTelemetry Collector​

Purpose​

Configuration​

Pipelines​

Prometheus (Metrics)​

Configuration​

Key Metrics​

Example Queries​

Loki (Logs)​

Configuration​

Log Labels​

LogQL Queries​

Structured Logging Best Practices​

Tempo (Distributed Tracing)​

Configuration​

Trace Structure​

Example Traces​

Querying Traces​

Grafana (Visualization)​

Auto-Provisioned Data Sources​

Access​

Key Dashboards​

Example Dashboard Panels​

Integration and Correlation​

Metrics → Traces (Exemplars)​

Logs → Traces​

Traces → Logs​

Traces → Metrics​

Instrumenting Applications​

Go Applications​

TypeScript/Node.js Applications​

Environment Configuration​

AppServer Configuration​

Production Configuration​

Alerting​

Prometheus Alerting Rules​

Loki Alerting Rules​

Best Practices​

Metrics​

Logs​

Traces​

Dashboards​

Troubleshooting​

High Cardinality Issues​

Missing Traces​

Log Ingestion Failures​

Code References​

Related Topics​

Overview

Architecture

OpenTelemetry Collector

Purpose

Configuration

Pipelines

Prometheus (Metrics)

Configuration

Key Metrics

Example Queries

Loki (Logs)

Configuration

Log Labels

LogQL Queries

Structured Logging Best Practices

Tempo (Distributed Tracing)

Configuration

Trace Structure

Example Traces

Querying Traces

Grafana (Visualization)

Auto-Provisioned Data Sources

Access

Key Dashboards

Example Dashboard Panels

Integration and Correlation

Metrics → Traces (Exemplars)

Logs → Traces

Traces → Logs

Traces → Metrics

Instrumenting Applications

Go Applications

TypeScript/Node.js Applications

Environment Configuration

AppServer Configuration

Production Configuration

Alerting

Prometheus Alerting Rules

Loki Alerting Rules

Best Practices

Metrics

Logs

Traces

Dashboards

Troubleshooting

High Cardinality Issues

Missing Traces

Log Ingestion Failures

Code References

Related Topics