Telemetry Stack
Observability infrastructure with metrics, logs, and traces using OpenTelemetry.
Architecture Overview
The telemetry stack follows the OpenTelemetry pattern where all services export to a central collector:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ AppServer │ │ Orchestrator│ │ Browser │
│ (Go) │ │ (Node.js) │ │ (JS) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
│ OTLP/HTTP │ │
└────────────┬────┴─────────────────┘
│
▼
┌──────────────────┐
│ OTEL Collector │
│ (Receiver + │
│ Processor + │
│ Exporter) │
└────────┬─────────┘
│
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌─────────┐
│ Tempo │ │Prometheus│ │ Loki │
│(Traces) │ │(Metrics) │ │ (Logs) │
└────┬────┘ └────┬─────┘ └────┬────┘
│ │ │
└───────────┼────────────┘
│
▼
┌──────────────┐
│ Grafana │
│(Visualization)│
└──────────────┘
OpenTelemetry Collector
The OTEL Collector is the central hub for all telemetry data. It receives, processes, and exports telemetry to backend storage systems.
Configuration
Location: docker/observability/otel-collector/otel-config.yml
Receivers
The collector accepts telemetry data via OTLP protocol:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
cors:
allowed_origins:
- "http://*"
- "https://*"
| Port | Protocol | Description |
|---|---|---|
| 4317 | gRPC | OTLP gRPC receiver (SDK default) |
| 4318 | HTTP | OTLP HTTP receiver (browser-friendly) |
Processors
Processors transform and enrich telemetry data:
processors:
# Batch for efficiency
batch:
timeout: 10s
send_batch_size: 1024
# Add common attributes
resource:
attributes:
- key: service.namespace
value: "appserver"
action: insert
- key: deployment.environment
value: "${env:DEPLOYMENT_ENVIRONMENT}"
action: insert
# Memory protection
memory_limiter:
check_interval: 1s
limit_percentage: 80
Exporters
Data is exported to specialized backends:
exporters:
# Traces → Tempo
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: true
# Metrics → Prometheus
prometheus:
endpoint: "0.0.0.0:8889"
namespace: appserver
# Logs → Loki
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
Pipelines
Pipelines connect receivers → processors → exporters:
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/tempo, logging]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch, resource]
exporters: [prometheus, logging]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [loki, logging]
Tempo (Traces)
Tempo is the distributed tracing backend that stores and queries trace data.
Configuration
Location: docker/observability/tempo/tempo-config.yml
Key Settings
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
compactor:
compaction:
block_retention: 720h # 30 days
# Generate metrics from traces
metrics_generator:
registry:
external_labels:
source: tempo
storage:
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true
Features
- TraceQL - SQL-like query language for traces
- Service Graph - Automatic service dependency mapping
- Span Metrics - Generate metrics from trace data
- Exemplars - Link metrics to specific traces
Retention
Default retention: 30 days (720h)
Configurable via compactor.compaction.block_retention
Loki (Logs)
Loki aggregates logs from all services with efficient indexing.
Configuration
Location: docker/observability/loki/loki-config.yml
Key Settings
server:
http_listen_port: 3100
# Storage configuration
storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
filesystem:
directory: /loki/chunks
# Retention and limits
limits_config:
retention_period: 720h # 30 days
ingestion_rate_mb: 10
max_entries_limit_per_query: 10000
Features
- LogQL - Query language for logs
- Label indexing - Efficient filtering by service, level, etc.
- Derived fields - Extract trace IDs from logs
- Alerting - Rule-based log alerting
Retention
Default retention: 30 days (720h)
Enforced by the compactor component.
Prometheus (Metrics)
Prometheus stores time-series metrics data.
Configuration
Location: docker/observability/prometheus/prometheus.yml
Key Settings
# Scrape configuration
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']
# Storage retention
storage:
tsdb:
retention.time: 15d
retention.size: 10GB
Metrics Sources
| Source | Endpoint | Description |
|---|---|---|
| OTEL Collector | :8889 | Application metrics from services |
| Tempo | :3200/metrics | Trace-derived metrics |
| Self-monitoring | :9090/metrics | Prometheus internal metrics |
Retention
Default retention: 15 days or 10GB (whichever comes first)
Grafana (Visualization)
Grafana provides unified visualization for all telemetry data.
Configuration
Environment variables in docker-compose.yml:
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: admin
GF_FEATURE_TOGGLES_ENABLE: "traceqlEditor,correlations"
Pre-Provisioned Datasources
Location: docker/observability/grafana/provisioning/datasources/
| Datasource | Type | URL | Default |
|---|---|---|---|
| Prometheus | prometheus | http://prometheus:9090 | Yes |
| Loki | loki | http://loki:3100 | No |
| Tempo | tempo | http://tempo:3200 | No |
Cross-Signal Correlation
Grafana is configured for automatic linking between signals:
- Logs → Traces: Derived fields extract
trace_idfrom logs - Traces → Logs: Tempo links to Loki with trace ID filter
- Metrics → Traces: Exemplars link metrics to specific traces
Docker Compose Services
Service Ports
| Service | Internal | External | Description |
|---|---|---|---|
| otel-collector | 4317/4318 | 4317/4318 | OTLP receivers |
| otel-collector | 8888 | 8888 | Collector metrics |
| otel-collector | 8889 | 8889 | Prometheus exporter |
| prometheus | 9090 | 9090 | Prometheus UI/API |
| loki | 3100 | 3100 | Loki API |
| tempo | 3200 | 3200 | Tempo API |
| grafana | 3000 | 3000 | Grafana UI |
Resource Requirements
Minimum recommended resources for development:
| Service | CPU | Memory |
|---|---|---|
| otel-collector | 0.5 | 512MB |
| prometheus | 0.5 | 1GB |
| loki | 0.5 | 512MB |
| tempo | 0.5 | 512MB |
| grafana | 0.25 | 256MB |
Scaling Considerations
Development (Single Instance)
The default Docker Compose configuration runs all services as single instances, suitable for development and testing.
Production
For production deployments, consider:
-
OTEL Collector
- Run multiple collector instances behind a load balancer
- Use the
loadbalancerexporter for high availability
-
Prometheus
- Use Thanos or Cortex for long-term storage
- Configure remote write to external storage
-
Loki
- Deploy in microservices mode for scalability
- Use object storage (S3, GCS) for chunks
-
Tempo
- Deploy in distributed mode
- Use object storage for trace blocks
Troubleshooting
No Data in Grafana
-
Check OTEL Collector is running:
docker logs appserver-otel-collector -
Verify services can reach the collector:
curl http://localhost:4318/v1/traces -
Check pipeline configuration in
otel-config.yml
High Memory Usage
- Reduce batch sizes in OTEL Collector
- Increase memory limits in processor config
- Reduce retention periods
Missing Traces
- Verify
tracesEnabled: truein service config - Check sampling rate (1.0 = all traces)
- Confirm Tempo is receiving data:
curl http://localhost:3200/ready
Missing Logs
-
Check Loki is healthy:
curl http://localhost:3100/ready -
Verify log labels in Grafana Explore
-
Check OTEL Collector logs for export errors