Multi-Instance Telemetry
Architecture and configuration guide for aggregating telemetry from multiple AppServer deployments into a central monitoring system.
Overview
When running AppServer for multiple customers, each customer typically gets their own isolated deployment. This guide explains how to:
- Aggregate telemetry from all instances to a central location
- Identify each instance in the telemetry data
- Create dashboards that support both global and per-instance views
- Secure the telemetry data flow between instances
Architecture
Recommended: Hierarchical Collectors
Each customer instance keeps its own OTEL Collector, which forwards to a central collector for aggregation.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Instance A │ │ Instance B │ │ Instance C │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ AppServer │ │ │ │ AppServer │ │ │ │ AppServer │ │
│ │ Node Apps │ │ │ │ Node Apps │ │ │ │ Node Apps │ │
│ │ Browser │ │ │ │ Browser │ │ │ │ Browser │ │
│ └─────┬─────┘ │ │ └─────┬─────┘ │ │ └─────┬─────┘ │
│ ▼ │ │ ▼ │ │ ▼ │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Instance │ │ │ │ Instance │ │ │ │ Instance │ │
│ │ OTEL │ │ │ │ OTEL │ │ │ │ OTEL │ │
│ │ Collector │ │ │ │ Collector │ │ │ │ Collector │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ +instance │ │ │ │ +instance │ │ │ │ +instance │ │
│ │ .id=A │ │ │ │ .id=B │ │ │ │ .id=C │ │
│ └─────┬─────┘ │ │ └─────┬─────┘ │ │ └─────┬─────┘ │
│ │ │ │ │ │ │ │ │
│ ▼ │ │ ▼ │ │ ▼ │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Local │ │ │ │ Local │ │ │ │ Local │ │
│ │ Backends │ │ │ │ Backends │ │ │ │ Backends │ │
│ │ (optional)│ │ │ │ (optional)│ │ │ │ (optional)│ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
└────────┼────────┘ └────────┼────────┘ └────────┼────────┘
│ │ │
└────────────────────┼────────────────────┘
▼
┌─────────────────────┐
│ Central OTEL │
│ Collector │
│ │
│ (aggregates all │
│ instances) │
└─────────┬───────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Tempo │ │Prometheus│ │ Loki │
│ (traces) │ │ (metrics)│ │ (logs) │
└──────────┘ └──────────┘ └──────────┘
│
▼
┌──────────┐
│ Grafana │
│ (central)│
└──────────┘
Benefits:
| Benefit | Description |
|---|---|
| Local observability | Each instance can optionally have local backends for on-site debugging |
| Resilience | Instance collectors buffer data during central collector outages |
| Isolation | Instance issues don't affect other instances' telemetry |
| Flexibility | Easy to add/remove instances without affecting the central system |
Alternative: Direct Connection
For simpler deployments, instances can connect directly to the central collector:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Instance A │ │ Instance B │ │ Instance C │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Services │──┼──┼──│ Services │──┼──┼──│ Services │ │
│ │ +instance │ │ │ │ +instance │ │ │ │ +instance │ │
│ │ .id=A │ │ │ │ .id=B │ │ │ │ .id=C │ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
└────────┼────────┘ └────────┼────────┘ └────────┼────────┘
│ │ │
└────────────────────┼────────────────────┘
▼
┌─────────────────────┐
│ Central OTEL │
│ Collector │
└─────────────────────┘
Use direct connection when:
- You don't need local observability
- Network latency to central collector is low
- Simplicity is preferred over resilience
Collector Communication
Protocol: OTLP (OpenTelemetry Protocol)
Instance collectors communicate with the central collector using OTLP, the native OpenTelemetry protocol. OTLP supports both gRPC and HTTP transports:
| Transport | Port | Use Case |
|---|---|---|
| gRPC | 4317 | Preferred for server-to-server communication (efficient, streaming) |
| HTTP | 4318 | When gRPC is blocked by firewalls or proxies |
Data Flow
┌─────────────────────────────────────────────────────────────────────────┐
│ Instance OTEL Collector │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. RECEIVE 2. PROCESS 3. EXPORT │
│ ┌─────────────┐ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ OTLP │ │ memory_limiter │ │ Local Backends │ │
│ │ Receiver │────▶│ batch │────▶│ (Tempo, Prometheus, │ │
│ │ (:4317/4318)│ │ resource │ │ Loki) │ │
│ └─────────────┘ │ (+instance.id) │ └─────────────────────┘ │
│ └────────┬────────┘ │ │
│ │ │ │
│ │ ┌──────────────┘ │
│ ▼ ▼ │
│ ┌─────────────────────┐ │
│ │ OTLP Exporter │ │
│ │ (otlp/central) │ │
│ │ │ │
│ │ - Sending Queue │ │
│ │ - Retry on Failure │ │
│ │ - TLS + Auth │ │
│ └──────────┬──────────┘ │
│ │ │
└─────────────────────────────────┼──────────────────────────────────────┘
│
│ OTLP/gRPC or OTLP/HTTP
│ (traces, metrics, logs)
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Central OTEL Collector │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. RECEIVE 2. PROCESS 3. EXPORT │
│ ┌─────────────┐ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ OTLP │ │ memory_limiter │ │ Central Backends │ │
│ │ Receiver │────▶│ batch │────▶│ (Tempo, Prometheus, │ │
│ │ (:4317/4318)│ │ │ │ Loki) │ │
│ └─────────────┘ └─────────────────┘ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Processing Pipeline
The instance collector processes telemetry in this order:
- Receive: OTLP receiver accepts traces, metrics, and logs from services
- Memory Limit: Prevents out-of-memory by dropping data if limits are exceeded
- Batch: Groups telemetry data for efficient export (reduces connections)
- Resource: Adds
instance.idandinstance.nameattributes to ALL data - Export: Sends to local backends AND central collector (in parallel)
Buffering and Reliability
The OTLP exporter to the central collector includes reliability features:
exporters:
otlp/central:
endpoint: "${env:CENTRAL_OTEL_ENDPOINT}"
sending_queue:
enabled: true
num_consumers: 10 # Parallel export workers
queue_size: 1000 # Buffer up to 1000 batches
retry_on_failure:
enabled: true
initial_interval: 5s # First retry after 5s
max_interval: 30s # Max backoff between retries
max_elapsed_time: 300s # Give up after 5 minutes
What happens during central collector outage:
- Instance collector queues data locally (up to
queue_sizebatches) - Retries with exponential backoff
- If queue fills, oldest data is dropped (local backends still receive data)
- When central collector recovers, queued data is sent
Telemetry Signals
All three telemetry signals use the same communication path:
| Signal | Content | Central Backend |
|---|---|---|
| Traces | Distributed request traces with spans | Tempo |
| Metrics | Counters, histograms, gauges | Prometheus |
| Logs | Structured JSON log entries | Loki |
Each signal carries the instance.id attribute, enabling filtering in the central Grafana.
Network Requirements
| Direction | Ports | Protocol | Notes |
|---|---|---|---|
| Instance → Central | 4317 (gRPC) or 4318 (HTTP) | OTLP | TLS recommended |
| Central → Backends | Various | Backend-specific | Internal network |
Firewall rules for instance collector:
# Outbound to central collector
ALLOW TCP instance:* → central:4317 (gRPC)
ALLOW TCP instance:* → central:4318 (HTTP)
Instance Identification
Resource Attributes
Each instance adds identifying attributes to all telemetry:
| Attribute | Description | Example |
|---|---|---|
instance.id | Unique identifier | customer-acme-prod |
instance.name | Human-readable name | ACME Corp Production |
customer.id | Customer identifier | acme |
These attributes are added by the OTEL Collector's resource processor, ensuring ALL telemetry (traces, metrics, logs) from that instance is tagged.
Where Attributes Are Added
Instance OTEL Collector (recommended):
The resource processor adds attributes to all incoming telemetry:
processors:
resource:
attributes:
- key: instance.id
value: "${env:INSTANCE_ID}"
action: insert
- key: instance.name
value: "${env:INSTANCE_NAME}"
action: insert
Application Level (alternative):
Use ExtraAttributes in Go telemetry config:
cfg := &telemetry.Config{
ServiceName: "appserver",
ExtraAttributes: map[string]string{
"instance.id": os.Getenv("INSTANCE_ID"),
"instance.name": os.Getenv("INSTANCE_NAME"),
},
}
Configuration
Instance Collector Setup
- Set environment variables in your deployment:
# Instance identification
INSTANCE_ID=customer-acme-prod
INSTANCE_NAME="ACME Corp Production"
# Central collector connection (optional)
CENTRAL_OTEL_ENDPOINT=https://central-otel.company.com:4317
CENTRAL_OTEL_TOKEN=<secure-token>
- Update the OTEL collector config to add instance attributes:
The instance collector config at docker/observability/otel-collector/otel-config.yml already includes:
processors:
resource:
attributes:
- key: service.namespace
value: "appserver"
action: insert
- key: deployment.environment
value: "${env:DEPLOYMENT_ENVIRONMENT}"
action: insert
# Instance identification
- key: instance.id
value: "${env:INSTANCE_ID}"
action: insert
- key: instance.name
value: "${env:INSTANCE_NAME}"
action: insert
- Enable central forwarding (for hierarchical setup):
Uncomment the otlp/central exporter and add it to pipelines:
exporters:
otlp/central:
endpoint: "${env:CENTRAL_OTEL_ENDPOINT}"
tls:
insecure: false
headers:
Authorization: "Bearer ${env:CENTRAL_OTEL_TOKEN}"
service:
pipelines:
traces:
exporters: [otlp/tempo, otlp/central, logging]
metrics:
exporters: [prometheus, otlp/central, logging]
logs:
exporters: [loki, otlp/central, logging]
Central Collector Setup
Deploy the central collector using the config at docker/observability/central-collector/otel-config.yml:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 80
spike_limit_percentage: 25
batch:
timeout: 10s
send_batch_size: 1024
send_batch_max_size: 2048
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: appserver
resource_to_telemetry_conversion:
enabled: true # Exposes instance.id as a metric label
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
Docker Compose Example
Central infrastructure (docker-compose.central.yml):
version: '3.8'
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
volumes:
- ./central-collector/otel-config.yml:/etc/otelcol/config.yaml
ports:
- "4317:4317" # gRPC
- "4318:4318" # HTTP
- "8889:8889" # Prometheus metrics
environment:
- OTEL_LOG_LEVEL=info
tempo:
image: grafana/tempo:latest
volumes:
- ./tempo/tempo-config.yml:/etc/tempo/config.yaml
ports:
- "3200:3200"
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
loki:
image: grafana/loki:latest
volumes:
- ./loki/loki-config.yml:/etc/loki/config.yaml
ports:
- "3100:3100"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
Dashboard Configuration
Instance Variable
Add a dashboard variable to filter by instance:
| Setting | Value |
|---|---|
| Name | instance |
| Type | Query |
| Data source | Prometheus |
| Query | label_values(appserver_http_requests_total, instance_id) |
| Multi-value | Yes |
| Include All option | Yes |
| All value | .* |
Updated Queries
All panel queries should include the instance_id filter:
# Request rate (filterable by instance)
sum(rate(appserver_http_requests_total{instance_id=~"$instance"}[5m])) * 60
# Error rate (filterable by instance)
sum(rate(appserver_http_requests_total{instance_id=~"$instance", status_group="5xx"}[5m]))
/ sum(rate(appserver_http_requests_total{instance_id=~"$instance"}[5m])) * 100
Instance Overview Panel
Add a table showing health across all instances:
# Error rate per instance
sum by (instance_id, instance_name) (
rate(appserver_http_requests_total{status_group="5xx"}[5m])
)
/
sum by (instance_id, instance_name) (
rate(appserver_http_requests_total[5m])
) * 100
See Grafana Dashboard Guide for complete panel configurations.
Security
Instance to Central Communication
TLS Encryption:
Always use TLS for production deployments:
exporters:
otlp/central:
endpoint: "central-otel.company.com:4317"
tls:
insecure: false
cert_file: /etc/certs/client.crt
key_file: /etc/certs/client.key
ca_file: /etc/certs/ca.crt
Bearer Token Authentication:
Use tokens to authenticate instances:
exporters:
otlp/central:
endpoint: "central-otel.company.com:4317"
headers:
Authorization: "Bearer ${env:CENTRAL_OTEL_TOKEN}"
Network Isolation:
- Use VPN or private networking between instances and central collector
- Restrict central collector ports to known instance IPs
- Consider using service mesh for mTLS
Data Isolation
Telemetry data is isolated by the instance_id attribute:
- Metrics: Filter by
instance_idlabel - Traces: Filter by
resource.instance.idattribute - Logs: Filter by
instance_idstream label
Grafana RBAC:
Create separate folders/dashboards per customer team with appropriate permissions.
Querying Multi-Instance Data
Prometheus (Metrics)
# All instances
sum(rate(appserver_http_requests_total[5m])) by (instance_id)
# Specific instance
sum(rate(appserver_http_requests_total{instance_id="customer-acme-prod"}[5m]))
# Compare instances
sum(rate(appserver_http_requests_total{instance_id=~"customer-acme.*"}[5m])) by (instance_id)
Tempo (Traces)
# All traces from specific instance
{ resource.instance.id = "customer-acme-prod" }
# Error traces from any instance
{ resource.instance.id != "" && status = error }
# Slow traces across instances
{ resource.instance.id != "" && duration > 1s }
Loki (Logs)
# All logs from specific instance
{instance_id="customer-acme-prod"}
# Errors across all instances
{instance_id!=""} | json | level = `error`
# Search by customer
{instance_id=~"customer-acme.*"} |= `connection refused`
Troubleshooting
Instance Not Appearing in Central Grafana
-
Check environment variables:
docker exec otel-collector env | grep INSTANCE -
Verify collector config:
- Ensure
resourceprocessor hasinstance.idattribute - Ensure
otlp/centralexporter is enabled in pipelines
- Ensure
-
Check network connectivity:
curl -v https://central-otel.company.com:4318/v1/traces -
Check collector logs:
docker logs otel-collector 2>&1 | grep -i error
Missing Instance Labels in Metrics
Ensure the central collector's Prometheus exporter has resource-to-telemetry conversion enabled:
exporters:
prometheus:
resource_to_telemetry_conversion:
enabled: true
Authentication Failures
-
Verify token is set:
echo $CENTRAL_OTEL_TOKEN -
Check central collector accepts the token
-
Verify TLS certificates are valid and not expired
Related Topics
- Grafana Dashboard Guide - Dashboard with instance filtering
- Telemetry Configuration - Configuration reference
- Telemetry Infrastructure - Single-instance architecture
- Grafana Integration - Query reference