Skip to main content

Multi-Instance Telemetry

Architecture and configuration guide for aggregating telemetry from multiple AppServer deployments into a central monitoring system.

Overview

When running AppServer for multiple customers, each customer typically gets their own isolated deployment. This guide explains how to:

  • Aggregate telemetry from all instances to a central location
  • Identify each instance in the telemetry data
  • Create dashboards that support both global and per-instance views
  • Secure the telemetry data flow between instances

Architecture

Each customer instance keeps its own OTEL Collector, which forwards to a central collector for aggregation.

┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ Instance A │ │ Instance B │ │ Instance C │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ AppServer │ │ │ │ AppServer │ │ │ │ AppServer │ │
│ │ Node Apps │ │ │ │ Node Apps │ │ │ │ Node Apps │ │
│ │ Browser │ │ │ │ Browser │ │ │ │ Browser │ │
│ └─────┬─────┘ │ │ └─────┬─────┘ │ │ └─────┬─────┘ │
│ ▼ │ │ ▼ │ │ ▼ │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Instance │ │ │ │ Instance │ │ │ │ Instance │ │
│ │ OTEL │ │ │ │ OTEL │ │ │ │ OTEL │ │
│ │ Collector │ │ │ │ Collector │ │ │ │ Collector │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ +instance │ │ │ │ +instance │ │ │ │ +instance │ │
│ │ .id=A │ │ │ │ .id=B │ │ │ │ .id=C │ │
│ └─────┬─────┘ │ │ └─────┬─────┘ │ │ └─────┬─────┘ │
│ │ │ │ │ │ │ │ │
│ ▼ │ │ ▼ │ │ ▼ │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Local │ │ │ │ Local │ │ │ │ Local │ │
│ │ Backends │ │ │ │ Backends │ │ │ │ Backends │ │
│ │ (optional)│ │ │ │ (optional)│ │ │ │ (optional)│ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
└────────┼────────┘ └────────┼────────┘ └────────┼────────┘
│ │ │
└────────────────────┼────────────────────┘

┌─────────────────────┐
│ Central OTEL │
│ Collector │
│ │
│ (aggregates all │
│ instances) │
└─────────┬───────────┘

┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Tempo │ │Prometheus│ │ Loki │
│ (traces) │ │ (metrics)│ │ (logs) │
└──────────┘ └──────────┘ └──────────┘


┌──────────┐
│ Grafana │
│ (central)│
└──────────┘

Benefits:

BenefitDescription
Local observabilityEach instance can optionally have local backends for on-site debugging
ResilienceInstance collectors buffer data during central collector outages
IsolationInstance issues don't affect other instances' telemetry
FlexibilityEasy to add/remove instances without affecting the central system

Alternative: Direct Connection

For simpler deployments, instances can connect directly to the central collector:

┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ Instance A │ │ Instance B │ │ Instance C │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Services │──┼──┼──│ Services │──┼──┼──│ Services │ │
│ │ +instance │ │ │ │ +instance │ │ │ │ +instance │ │
│ │ .id=A │ │ │ │ .id=B │ │ │ │ .id=C │ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
└────────┼────────┘ └────────┼────────┘ └────────┼────────┘
│ │ │
└────────────────────┼────────────────────┘

┌─────────────────────┐
│ Central OTEL │
│ Collector │
└─────────────────────┘

Use direct connection when:

  • You don't need local observability
  • Network latency to central collector is low
  • Simplicity is preferred over resilience

Collector Communication

Protocol: OTLP (OpenTelemetry Protocol)

Instance collectors communicate with the central collector using OTLP, the native OpenTelemetry protocol. OTLP supports both gRPC and HTTP transports:

TransportPortUse Case
gRPC4317Preferred for server-to-server communication (efficient, streaming)
HTTP4318When gRPC is blocked by firewalls or proxies

Data Flow

┌─────────────────────────────────────────────────────────────────────────┐
│ Instance OTEL Collector │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. RECEIVE 2. PROCESS 3. EXPORT │
│ ┌─────────────┐ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ OTLP │ │ memory_limiter │ │ Local Backends │ │
│ │ Receiver │────▶│ batch │────▶│ (Tempo, Prometheus, │ │
│ │ (:4317/4318)│ │ resource │ │ Loki) │ │
│ └─────────────┘ │ (+instance.id) │ └─────────────────────┘ │
│ └────────┬────────┘ │ │
│ │ │ │
│ │ ┌──────────────┘ │
│ ▼ ▼ │
│ ┌─────────────────────┐ │
│ │ OTLP Exporter │ │
│ │ (otlp/central) │ │
│ │ │ │
│ │ - Sending Queue │ │
│ │ - Retry on Failure │ │
│ │ - TLS + Auth │ │
│ └──────────┬──────────┘ │
│ │ │
└─────────────────────────────────┼──────────────────────────────────────┘

│ OTLP/gRPC or OTLP/HTTP
│ (traces, metrics, logs)

┌─────────────────────────────────────────────────────────────────────────┐
│ Central OTEL Collector │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. RECEIVE 2. PROCESS 3. EXPORT │
│ ┌─────────────┐ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ OTLP │ │ memory_limiter │ │ Central Backends │ │
│ │ Receiver │────▶│ batch │────▶│ (Tempo, Prometheus, │ │
│ │ (:4317/4318)│ │ │ │ Loki) │ │
│ └─────────────┘ └─────────────────┘ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘

Processing Pipeline

The instance collector processes telemetry in this order:

  1. Receive: OTLP receiver accepts traces, metrics, and logs from services
  2. Memory Limit: Prevents out-of-memory by dropping data if limits are exceeded
  3. Batch: Groups telemetry data for efficient export (reduces connections)
  4. Resource: Adds instance.id and instance.name attributes to ALL data
  5. Export: Sends to local backends AND central collector (in parallel)

Buffering and Reliability

The OTLP exporter to the central collector includes reliability features:

exporters:
otlp/central:
endpoint: "${env:CENTRAL_OTEL_ENDPOINT}"
sending_queue:
enabled: true
num_consumers: 10 # Parallel export workers
queue_size: 1000 # Buffer up to 1000 batches
retry_on_failure:
enabled: true
initial_interval: 5s # First retry after 5s
max_interval: 30s # Max backoff between retries
max_elapsed_time: 300s # Give up after 5 minutes

What happens during central collector outage:

  1. Instance collector queues data locally (up to queue_size batches)
  2. Retries with exponential backoff
  3. If queue fills, oldest data is dropped (local backends still receive data)
  4. When central collector recovers, queued data is sent

Telemetry Signals

All three telemetry signals use the same communication path:

SignalContentCentral Backend
TracesDistributed request traces with spansTempo
MetricsCounters, histograms, gaugesPrometheus
LogsStructured JSON log entriesLoki

Each signal carries the instance.id attribute, enabling filtering in the central Grafana.

Network Requirements

DirectionPortsProtocolNotes
Instance → Central4317 (gRPC) or 4318 (HTTP)OTLPTLS recommended
Central → BackendsVariousBackend-specificInternal network

Firewall rules for instance collector:

# Outbound to central collector
ALLOW TCP instance:* → central:4317 (gRPC)
ALLOW TCP instance:* → central:4318 (HTTP)

Instance Identification

Resource Attributes

Each instance adds identifying attributes to all telemetry:

AttributeDescriptionExample
instance.idUnique identifiercustomer-acme-prod
instance.nameHuman-readable nameACME Corp Production
customer.idCustomer identifieracme

These attributes are added by the OTEL Collector's resource processor, ensuring ALL telemetry (traces, metrics, logs) from that instance is tagged.

Where Attributes Are Added

Instance OTEL Collector (recommended):

The resource processor adds attributes to all incoming telemetry:

processors:
resource:
attributes:
- key: instance.id
value: "${env:INSTANCE_ID}"
action: insert
- key: instance.name
value: "${env:INSTANCE_NAME}"
action: insert

Application Level (alternative):

Use ExtraAttributes in Go telemetry config:

cfg := &telemetry.Config{
ServiceName: "appserver",
ExtraAttributes: map[string]string{
"instance.id": os.Getenv("INSTANCE_ID"),
"instance.name": os.Getenv("INSTANCE_NAME"),
},
}

Configuration

Instance Collector Setup

  1. Set environment variables in your deployment:
# Instance identification
INSTANCE_ID=customer-acme-prod
INSTANCE_NAME="ACME Corp Production"

# Central collector connection (optional)
CENTRAL_OTEL_ENDPOINT=https://central-otel.company.com:4317
CENTRAL_OTEL_TOKEN=<secure-token>
  1. Update the OTEL collector config to add instance attributes:

The instance collector config at docker/observability/otel-collector/otel-config.yml already includes:

processors:
resource:
attributes:
- key: service.namespace
value: "appserver"
action: insert
- key: deployment.environment
value: "${env:DEPLOYMENT_ENVIRONMENT}"
action: insert
# Instance identification
- key: instance.id
value: "${env:INSTANCE_ID}"
action: insert
- key: instance.name
value: "${env:INSTANCE_NAME}"
action: insert
  1. Enable central forwarding (for hierarchical setup):

Uncomment the otlp/central exporter and add it to pipelines:

exporters:
otlp/central:
endpoint: "${env:CENTRAL_OTEL_ENDPOINT}"
tls:
insecure: false
headers:
Authorization: "Bearer ${env:CENTRAL_OTEL_TOKEN}"

service:
pipelines:
traces:
exporters: [otlp/tempo, otlp/central, logging]
metrics:
exporters: [prometheus, otlp/central, logging]
logs:
exporters: [loki, otlp/central, logging]

Central Collector Setup

Deploy the central collector using the config at docker/observability/central-collector/otel-config.yml:

receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

processors:
memory_limiter:
check_interval: 1s
limit_percentage: 80
spike_limit_percentage: 25
batch:
timeout: 10s
send_batch_size: 1024
send_batch_max_size: 2048

exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: appserver
resource_to_telemetry_conversion:
enabled: true # Exposes instance.id as a metric label
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: true

service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]

Docker Compose Example

Central infrastructure (docker-compose.central.yml):

version: '3.8'

services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
volumes:
- ./central-collector/otel-config.yml:/etc/otelcol/config.yaml
ports:
- "4317:4317" # gRPC
- "4318:4318" # HTTP
- "8889:8889" # Prometheus metrics
environment:
- OTEL_LOG_LEVEL=info

tempo:
image: grafana/tempo:latest
volumes:
- ./tempo/tempo-config.yml:/etc/tempo/config.yaml
ports:
- "3200:3200"

prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"

loki:
image: grafana/loki:latest
volumes:
- ./loki/loki-config.yml:/etc/loki/config.yaml
ports:
- "3100:3100"

grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning

Dashboard Configuration

Instance Variable

Add a dashboard variable to filter by instance:

SettingValue
Nameinstance
TypeQuery
Data sourcePrometheus
Querylabel_values(appserver_http_requests_total, instance_id)
Multi-valueYes
Include All optionYes
All value.*

Updated Queries

All panel queries should include the instance_id filter:

# Request rate (filterable by instance)
sum(rate(appserver_http_requests_total{instance_id=~"$instance"}[5m])) * 60

# Error rate (filterable by instance)
sum(rate(appserver_http_requests_total{instance_id=~"$instance", status_group="5xx"}[5m]))
/ sum(rate(appserver_http_requests_total{instance_id=~"$instance"}[5m])) * 100

Instance Overview Panel

Add a table showing health across all instances:

# Error rate per instance
sum by (instance_id, instance_name) (
rate(appserver_http_requests_total{status_group="5xx"}[5m])
)
/
sum by (instance_id, instance_name) (
rate(appserver_http_requests_total[5m])
) * 100

See Grafana Dashboard Guide for complete panel configurations.

Security

Instance to Central Communication

TLS Encryption:

Always use TLS for production deployments:

exporters:
otlp/central:
endpoint: "central-otel.company.com:4317"
tls:
insecure: false
cert_file: /etc/certs/client.crt
key_file: /etc/certs/client.key
ca_file: /etc/certs/ca.crt

Bearer Token Authentication:

Use tokens to authenticate instances:

exporters:
otlp/central:
endpoint: "central-otel.company.com:4317"
headers:
Authorization: "Bearer ${env:CENTRAL_OTEL_TOKEN}"

Network Isolation:

  • Use VPN or private networking between instances and central collector
  • Restrict central collector ports to known instance IPs
  • Consider using service mesh for mTLS

Data Isolation

Telemetry data is isolated by the instance_id attribute:

  • Metrics: Filter by instance_id label
  • Traces: Filter by resource.instance.id attribute
  • Logs: Filter by instance_id stream label

Grafana RBAC:

Create separate folders/dashboards per customer team with appropriate permissions.

Querying Multi-Instance Data

Prometheus (Metrics)

# All instances
sum(rate(appserver_http_requests_total[5m])) by (instance_id)

# Specific instance
sum(rate(appserver_http_requests_total{instance_id="customer-acme-prod"}[5m]))

# Compare instances
sum(rate(appserver_http_requests_total{instance_id=~"customer-acme.*"}[5m])) by (instance_id)

Tempo (Traces)

# All traces from specific instance
{ resource.instance.id = "customer-acme-prod" }

# Error traces from any instance
{ resource.instance.id != "" && status = error }

# Slow traces across instances
{ resource.instance.id != "" && duration > 1s }

Loki (Logs)

# All logs from specific instance
{instance_id="customer-acme-prod"}

# Errors across all instances
{instance_id!=""} | json | level = `error`

# Search by customer
{instance_id=~"customer-acme.*"} |= `connection refused`

Troubleshooting

Instance Not Appearing in Central Grafana

  1. Check environment variables:

    docker exec otel-collector env | grep INSTANCE
  2. Verify collector config:

    • Ensure resource processor has instance.id attribute
    • Ensure otlp/central exporter is enabled in pipelines
  3. Check network connectivity:

    curl -v https://central-otel.company.com:4318/v1/traces
  4. Check collector logs:

    docker logs otel-collector 2>&1 | grep -i error

Missing Instance Labels in Metrics

Ensure the central collector's Prometheus exporter has resource-to-telemetry conversion enabled:

exporters:
prometheus:
resource_to_telemetry_conversion:
enabled: true

Authentication Failures

  1. Verify token is set:

    echo $CENTRAL_OTEL_TOKEN
  2. Check central collector accepts the token

  3. Verify TLS certificates are valid and not expired