Skip to main content

Telemetry Stack

Observability infrastructure with metrics, logs, and traces using OpenTelemetry.

Architecture Overview

The telemetry stack follows the OpenTelemetry pattern where all services export to a central collector:

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│ AppServer │ │ Orchestrator│ │ Browser │
│ (Go) │ │ (Node.js) │ │ (JS) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
│ OTLP/HTTP │ │
└────────────┬────┴─────────────────┘


┌──────────────────┐
│ OTEL Collector │
│ (Receiver + │
│ Processor + │
│ Exporter) │
└────────┬─────────┘

┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌─────────┐
│ Tempo │ │Prometheus│ │ Loki │
│(Traces) │ │(Metrics) │ │ (Logs) │
└────┬────┘ └────┬─────┘ └────┬────┘
│ │ │
└───────────┼────────────┘


┌──────────────┐
│ Grafana │
│(Visualization)│
└──────────────┘

OpenTelemetry Collector

The OTEL Collector is the central hub for all telemetry data. It receives, processes, and exports telemetry to backend storage systems.

Configuration

Location: docker/observability/otel-collector/otel-config.yml

Receivers

The collector accepts telemetry data via OTLP protocol:

receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
cors:
allowed_origins:
- "http://*"
- "https://*"
PortProtocolDescription
4317gRPCOTLP gRPC receiver (SDK default)
4318HTTPOTLP HTTP receiver (browser-friendly)

Processors

Processors transform and enrich telemetry data:

processors:
# Batch for efficiency
batch:
timeout: 10s
send_batch_size: 1024

# Add common attributes
resource:
attributes:
- key: service.namespace
value: "appserver"
action: insert
- key: deployment.environment
value: "${env:DEPLOYMENT_ENVIRONMENT}"
action: insert

# Memory protection
memory_limiter:
check_interval: 1s
limit_percentage: 80

Exporters

Data is exported to specialized backends:

exporters:
# Traces → Tempo
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: true

# Metrics → Prometheus
prometheus:
endpoint: "0.0.0.0:8889"
namespace: appserver

# Logs → Loki
loki:
endpoint: "http://loki:3100/loki/api/v1/push"

Pipelines

Pipelines connect receivers → processors → exporters:

service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/tempo, logging]

metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch, resource]
exporters: [prometheus, logging]

logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [loki, logging]

Tempo (Traces)

Tempo is the distributed tracing backend that stores and queries trace data.

Configuration

Location: docker/observability/tempo/tempo-config.yml

Key Settings

server:
http_listen_port: 3200

distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317

compactor:
compaction:
block_retention: 720h # 30 days

# Generate metrics from traces
metrics_generator:
registry:
external_labels:
source: tempo
storage:
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true

Features

  • TraceQL - SQL-like query language for traces
  • Service Graph - Automatic service dependency mapping
  • Span Metrics - Generate metrics from trace data
  • Exemplars - Link metrics to specific traces

Retention

Default retention: 30 days (720h)

Configurable via compactor.compaction.block_retention

Loki (Logs)

Loki aggregates logs from all services with efficient indexing.

Configuration

Location: docker/observability/loki/loki-config.yml

Key Settings

server:
http_listen_port: 3100

# Storage configuration
storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
filesystem:
directory: /loki/chunks

# Retention and limits
limits_config:
retention_period: 720h # 30 days
ingestion_rate_mb: 10
max_entries_limit_per_query: 10000

Features

  • LogQL - Query language for logs
  • Label indexing - Efficient filtering by service, level, etc.
  • Derived fields - Extract trace IDs from logs
  • Alerting - Rule-based log alerting

Retention

Default retention: 30 days (720h)

Enforced by the compactor component.

Prometheus (Metrics)

Prometheus stores time-series metrics data.

Configuration

Location: docker/observability/prometheus/prometheus.yml

Key Settings

# Scrape configuration
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']

# Storage retention
storage:
tsdb:
retention.time: 15d
retention.size: 10GB

Metrics Sources

SourceEndpointDescription
OTEL Collector:8889Application metrics from services
Tempo:3200/metricsTrace-derived metrics
Self-monitoring:9090/metricsPrometheus internal metrics

Retention

Default retention: 15 days or 10GB (whichever comes first)

Grafana (Visualization)

Grafana provides unified visualization for all telemetry data.

Configuration

Environment variables in docker-compose.yml:

environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: admin
GF_FEATURE_TOGGLES_ENABLE: "traceqlEditor,correlations"

Pre-Provisioned Datasources

Location: docker/observability/grafana/provisioning/datasources/

DatasourceTypeURLDefault
Prometheusprometheushttp://prometheus:9090Yes
Lokilokihttp://loki:3100No
Tempotempohttp://tempo:3200No

Cross-Signal Correlation

Grafana is configured for automatic linking between signals:

  • Logs → Traces: Derived fields extract trace_id from logs
  • Traces → Logs: Tempo links to Loki with trace ID filter
  • Metrics → Traces: Exemplars link metrics to specific traces

Docker Compose Services

Service Ports

ServiceInternalExternalDescription
otel-collector4317/43184317/4318OTLP receivers
otel-collector88888888Collector metrics
otel-collector88898889Prometheus exporter
prometheus90909090Prometheus UI/API
loki31003100Loki API
tempo32003200Tempo API
grafana30003000Grafana UI

Resource Requirements

Minimum recommended resources for development:

ServiceCPUMemory
otel-collector0.5512MB
prometheus0.51GB
loki0.5512MB
tempo0.5512MB
grafana0.25256MB

Scaling Considerations

Development (Single Instance)

The default Docker Compose configuration runs all services as single instances, suitable for development and testing.

Production

For production deployments, consider:

  1. OTEL Collector

    • Run multiple collector instances behind a load balancer
    • Use the loadbalancer exporter for high availability
  2. Prometheus

    • Use Thanos or Cortex for long-term storage
    • Configure remote write to external storage
  3. Loki

    • Deploy in microservices mode for scalability
    • Use object storage (S3, GCS) for chunks
  4. Tempo

    • Deploy in distributed mode
    • Use object storage for trace blocks

Troubleshooting

No Data in Grafana

  1. Check OTEL Collector is running:

    docker logs appserver-otel-collector
  2. Verify services can reach the collector:

    curl http://localhost:4318/v1/traces
  3. Check pipeline configuration in otel-config.yml

High Memory Usage

  1. Reduce batch sizes in OTEL Collector
  2. Increase memory limits in processor config
  3. Reduce retention periods

Missing Traces

  1. Verify tracesEnabled: true in service config
  2. Check sampling rate (1.0 = all traces)
  3. Confirm Tempo is receiving data:
    curl http://localhost:3200/ready

Missing Logs

  1. Check Loki is healthy:

    curl http://localhost:3100/ready
  2. Verify log labels in Grafana Explore

  3. Check OTEL Collector logs for export errors