Multi-Instance Telemetry

Architecture and configuration guide for aggregating telemetry from multiple AppServer deployments into a central monitoring system.

Overview

When running AppServer for multiple customers, each customer typically gets their own isolated deployment. This guide explains how to:

Aggregate telemetry from all instances to a central location
Identify each instance in the telemetry data
Create dashboards that support both global and per-instance views
Secure the telemetry data flow between instances

Architecture

Recommended: Hierarchical Collectors

Each customer instance keeps its own OTEL Collector, which forwards to a central collector for aggregation.

┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│  Instance A     │  │  Instance B     │  │  Instance C     │
│  ┌───────────┐  │  │  ┌───────────┐  │  │  ┌───────────┐  │
│  │ AppServer │  │  │  │ AppServer │  │  │  │ AppServer │  │
│  │ Node Apps │  │  │  │ Node Apps │  │  │  │ Node Apps │  │
│  │ Browser   │  │  │  │ Browser   │  │  │  │ Browser   │  │
│  └─────┬─────┘  │  │  └─────┬─────┘  │  │  └─────┬─────┘  │
│        ▼        │  │        ▼        │  │        ▼        │
│  ┌───────────┐  │  │  ┌───────────┐  │  │  ┌───────────┐  │
│  │ Instance  │  │  │  │ Instance  │  │  │  │ Instance  │  │
│  │ OTEL      │  │  │  │ OTEL      │  │  │  │ OTEL      │  │
│  │ Collector │  │  │  │ Collector │  │  │  │ Collector │  │
│  │           │  │  │  │           │  │  │  │           │  │
│  │ +instance │  │  │  │ +instance │  │  │  │ +instance │  │
│  │  .id=A    │  │  │  │  .id=B    │  │  │  │  .id=C    │  │
│  └─────┬─────┘  │  │  └─────┬─────┘  │  │  └─────┬─────┘  │
│        │        │  │        │        │  │        │        │
│        ▼        │  │        ▼        │  │        ▼        │
│  ┌───────────┐  │  │  ┌───────────┐  │  │  ┌───────────┐  │
│  │  Local    │  │  │  │  Local    │  │  │  │  Local    │  │
│  │ Backends  │  │  │  │ Backends  │  │  │  │ Backends  │  │
│  │ (optional)│  │  │  │ (optional)│  │  │  │ (optional)│  │
│  └───────────┘  │  │  └───────────┘  │  │  └───────────┘  │
└────────┼────────┘  └────────┼────────┘  └────────┼────────┘
         │                    │                    │
         └────────────────────┼────────────────────┘
                              ▼
                   ┌─────────────────────┐
                   │   Central OTEL      │
                   │   Collector         │
                   │                     │
                   │   (aggregates all   │
                   │    instances)       │
                   └─────────┬───────────┘
                             │
         ┌───────────────────┼───────────────────┐
         ▼                   ▼                   ▼
   ┌──────────┐       ┌──────────┐       ┌──────────┐
   │  Tempo   │       │Prometheus│       │  Loki    │
   │ (traces) │       │ (metrics)│       │  (logs)  │
   └──────────┘       └──────────┘       └──────────┘
                             │
                             ▼
                      ┌──────────┐
                      │ Grafana  │
                      │ (central)│
                      └──────────┘

Benefits:

Benefit	Description
Local observability	Each instance can optionally have local backends for on-site debugging
Resilience	Instance collectors buffer data during central collector outages
Isolation	Instance issues don't affect other instances' telemetry
Flexibility	Easy to add/remove instances without affecting the central system

Alternative: Direct Connection

For simpler deployments, instances can connect directly to the central collector:

┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│  Instance A     │  │  Instance B     │  │  Instance C     │
│  ┌───────────┐  │  │  ┌───────────┐  │  │  ┌───────────┐  │
│  │ Services  │──┼──┼──│ Services  │──┼──┼──│ Services  │  │
│  │ +instance │  │  │  │ +instance │  │  │  │ +instance │  │
│  │  .id=A    │  │  │  │  .id=B    │  │  │  │  .id=C    │  │
│  └───────────┘  │  │  └───────────┘  │  │  └───────────┘  │
└────────┼────────┘  └────────┼────────┘  └────────┼────────┘
         │                    │                    │
         └────────────────────┼────────────────────┘
                              ▼
                   ┌─────────────────────┐
                   │   Central OTEL      │
                   │   Collector         │
                   └─────────────────────┘

Use direct connection when:

You don't need local observability
Network latency to central collector is low
Simplicity is preferred over resilience

Collector Communication

Protocol: OTLP (OpenTelemetry Protocol)

Instance collectors communicate with the central collector using OTLP, the native OpenTelemetry protocol. OTLP supports both gRPC and HTTP transports:

Transport	Port	Use Case
gRPC	4317	Preferred for server-to-server communication (efficient, streaming)
HTTP	4318	When gRPC is blocked by firewalls or proxies

Data Flow

┌─────────────────────────────────────────────────────────────────────────┐
│                        Instance OTEL Collector                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  1. RECEIVE          2. PROCESS              3. EXPORT                  │
│  ┌─────────────┐     ┌─────────────────┐     ┌─────────────────────┐   │
│  │ OTLP        │     │ memory_limiter  │     │ Local Backends      │   │
│  │ Receiver    │────▶│ batch           │────▶│ (Tempo, Prometheus, │   │
│  │ (:4317/4318)│     │ resource        │     │  Loki)              │   │
│  └─────────────┘     │ (+instance.id)  │     └─────────────────────┘   │
│                      └────────┬────────┘               │               │
│                               │                        │               │
│                               │         ┌──────────────┘               │
│                               ▼         ▼                              │
│                      ┌─────────────────────┐                           │
│                      │ OTLP Exporter       │                           │
│                      │ (otlp/central)      │                           │
│                      │                     │                           │
│                      │ - Sending Queue     │                           │
│                      │ - Retry on Failure  │                           │
│                      │ - TLS + Auth        │                           │
│                      └──────────┬──────────┘                           │
│                                 │                                      │
└─────────────────────────────────┼──────────────────────────────────────┘
                                  │
                                  │ OTLP/gRPC or OTLP/HTTP
                                  │ (traces, metrics, logs)
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        Central OTEL Collector                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  1. RECEIVE          2. PROCESS              3. EXPORT                  │
│  ┌─────────────┐     ┌─────────────────┐     ┌─────────────────────┐   │
│  │ OTLP        │     │ memory_limiter  │     │ Central Backends    │   │
│  │ Receiver    │────▶│ batch           │────▶│ (Tempo, Prometheus, │   │
│  │ (:4317/4318)│     │                 │     │  Loki)              │   │
│  └─────────────┘     └─────────────────┘     └─────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Processing Pipeline

The instance collector processes telemetry in this order:

Receive: OTLP receiver accepts traces, metrics, and logs from services
Memory Limit: Prevents out-of-memory by dropping data if limits are exceeded
Batch: Groups telemetry data for efficient export (reduces connections)
Resource: Adds instance.id and instance.name attributes to ALL data
Export: Sends to local backends AND central collector (in parallel)

Buffering and Reliability

The OTLP exporter to the central collector includes reliability features:

exporters:
  otlp/central:
    endpoint: "${env:CENTRAL_OTEL_ENDPOINT}"
    sending_queue:
      enabled: true
      num_consumers: 10      # Parallel export workers
      queue_size: 1000       # Buffer up to 1000 batches
    retry_on_failure:
      enabled: true
      initial_interval: 5s   # First retry after 5s
      max_interval: 30s      # Max backoff between retries
      max_elapsed_time: 300s # Give up after 5 minutes

What happens during central collector outage:

Instance collector queues data locally (up to queue_size batches)
Retries with exponential backoff
If queue fills, oldest data is dropped (local backends still receive data)
When central collector recovers, queued data is sent

Telemetry Signals

All three telemetry signals use the same communication path:

Signal	Content	Central Backend
Traces	Distributed request traces with spans	Tempo
Metrics	Counters, histograms, gauges	Prometheus
Logs	Structured JSON log entries	Loki

Each signal carries the instance.id attribute, enabling filtering in the central Grafana.

Network Requirements

Direction	Ports	Protocol	Notes
Instance → Central	4317 (gRPC) or 4318 (HTTP)	OTLP	TLS recommended
Central → Backends	Various	Backend-specific	Internal network

Firewall rules for instance collector:

# Outbound to central collector
ALLOW TCP instance:* → central:4317 (gRPC)
ALLOW TCP instance:* → central:4318 (HTTP)

Instance Identification

Resource Attributes

Each instance adds identifying attributes to all telemetry:

Attribute	Description	Example
`instance.id`	Unique identifier	`customer-acme-prod`
`instance.name`	Human-readable name	`ACME Corp Production`
`customer.id`	Customer identifier	`acme`

These attributes are added by the OTEL Collector's resource processor, ensuring ALL telemetry (traces, metrics, logs) from that instance is tagged.

Where Attributes Are Added

Instance OTEL Collector (recommended):

The resource processor adds attributes to all incoming telemetry:

processors:
  resource:
    attributes:
      - key: instance.id
        value: "${env:INSTANCE_ID}"
        action: insert
      - key: instance.name
        value: "${env:INSTANCE_NAME}"
        action: insert

Application Level (alternative):

Use ExtraAttributes in Go telemetry config:

cfg := &telemetry.Config{
    ServiceName: "appserver",
    ExtraAttributes: map[string]string{
        "instance.id":   os.Getenv("INSTANCE_ID"),
        "instance.name": os.Getenv("INSTANCE_NAME"),
    },
}

Configuration

Instance Collector Setup

Set environment variables in your deployment:

# Instance identification
INSTANCE_ID=customer-acme-prod
INSTANCE_NAME="ACME Corp Production"

# Central collector connection (optional)
CENTRAL_OTEL_ENDPOINT=https://central-otel.company.com:4317
CENTRAL_OTEL_TOKEN=<secure-token>

Update the OTEL collector config to add instance attributes:

The instance collector config at docker/observability/otel-collector/otel-config.yml already includes:

processors:
  resource:
    attributes:
      - key: service.namespace
        value: "appserver"
        action: insert
      - key: deployment.environment
        value: "${env:DEPLOYMENT_ENVIRONMENT}"
        action: insert
      # Instance identification
      - key: instance.id
        value: "${env:INSTANCE_ID}"
        action: insert
      - key: instance.name
        value: "${env:INSTANCE_NAME}"
        action: insert

Enable central forwarding (for hierarchical setup):

Uncomment the otlp/central exporter and add it to pipelines:

exporters:
  otlp/central:
    endpoint: "${env:CENTRAL_OTEL_ENDPOINT}"
    tls:
      insecure: false
    headers:
      Authorization: "Bearer ${env:CENTRAL_OTEL_TOKEN}"

service:
  pipelines:
    traces:
      exporters: [otlp/tempo, otlp/central, logging]
    metrics:
      exporters: [prometheus, otlp/central, logging]
    logs:
      exporters: [loki, otlp/central, logging]

Central Collector Setup

Deploy the central collector using the config at docker/observability/central-collector/otel-config.yml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 25
  batch:
    timeout: 10s
    send_batch_size: 1024
    send_batch_max_size: 2048

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: appserver
    resource_to_telemetry_conversion:
      enabled: true  # Exposes instance.id as a metric label
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Docker Compose Example

Central infrastructure (docker-compose.central.yml):

version: '3.8'

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    volumes:
      - ./central-collector/otel-config.yml:/etc/otelcol/config.yaml
    ports:
      - "4317:4317"   # gRPC
      - "4318:4318"   # HTTP
      - "8889:8889"   # Prometheus metrics
    environment:
      - OTEL_LOG_LEVEL=info

  tempo:
    image: grafana/tempo:latest
    volumes:
      - ./tempo/tempo-config.yml:/etc/tempo/config.yaml
    ports:
      - "3200:3200"

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  loki:
    image: grafana/loki:latest
    volumes:
      - ./loki/loki-config.yml:/etc/loki/config.yaml
    ports:
      - "3100:3100"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning

Dashboard Configuration

Instance Variable

Add a dashboard variable to filter by instance:

Setting	Value
Name	`instance`
Type	Query
Data source	Prometheus
Query	`label_values(appserver_http_requests_total, instance_id)`
Multi-value	Yes
Include All option	Yes
All value	`.*`

Updated Queries

All panel queries should include the instance_id filter:

# Request rate (filterable by instance)
sum(rate(appserver_http_requests_total{instance_id=~"$instance"}[5m])) * 60

# Error rate (filterable by instance)
sum(rate(appserver_http_requests_total{instance_id=~"$instance", status_group="5xx"}[5m]))
/ sum(rate(appserver_http_requests_total{instance_id=~"$instance"}[5m])) * 100

Instance Overview Panel

Add a table showing health across all instances:

# Error rate per instance
sum by (instance_id, instance_name) (
  rate(appserver_http_requests_total{status_group="5xx"}[5m])
)
/
sum by (instance_id, instance_name) (
  rate(appserver_http_requests_total[5m])
) * 100

See Grafana Dashboard Guide for complete panel configurations.

Security

Instance to Central Communication

TLS Encryption:

Always use TLS for production deployments:

exporters:
  otlp/central:
    endpoint: "central-otel.company.com:4317"
    tls:
      insecure: false
      cert_file: /etc/certs/client.crt
      key_file: /etc/certs/client.key
      ca_file: /etc/certs/ca.crt

Bearer Token Authentication:

Use tokens to authenticate instances:

exporters:
  otlp/central:
    endpoint: "central-otel.company.com:4317"
    headers:
      Authorization: "Bearer ${env:CENTRAL_OTEL_TOKEN}"

Network Isolation:

Use VPN or private networking between instances and central collector
Restrict central collector ports to known instance IPs
Consider using service mesh for mTLS

Data Isolation

Telemetry data is isolated by the instance_id attribute:

Metrics: Filter by instance_id label
Traces: Filter by resource.instance.id attribute
Logs: Filter by instance_id stream label

Grafana RBAC:

Create separate folders/dashboards per customer team with appropriate permissions.

Querying Multi-Instance Data

Prometheus (Metrics)

# All instances
sum(rate(appserver_http_requests_total[5m])) by (instance_id)

# Specific instance
sum(rate(appserver_http_requests_total{instance_id="customer-acme-prod"}[5m]))

# Compare instances
sum(rate(appserver_http_requests_total{instance_id=~"customer-acme.*"}[5m])) by (instance_id)

Tempo (Traces)

# All traces from specific instance
{ resource.instance.id = "customer-acme-prod" }

# Error traces from any instance
{ resource.instance.id != "" && status = error }

# Slow traces across instances
{ resource.instance.id != "" && duration > 1s }

Loki (Logs)

# All logs from specific instance
{instance_id="customer-acme-prod"}

# Errors across all instances
{instance_id!=""} | json | level = `error`

# Search by customer
{instance_id=~"customer-acme.*"} |= `connection refused`

Troubleshooting

Instance Not Appearing in Central Grafana

Check environment variables:

docker exec otel-collector env | grep INSTANCE

Verify collector config:
- Ensure resource processor has instance.id attribute
- Ensure otlp/central exporter is enabled in pipelines

Check network connectivity:

curl -v https://central-otel.company.com:4318/v1/traces

Check collector logs:

docker logs otel-collector 2>&1 | grep -i error

Missing Instance Labels in Metrics

Ensure the central collector's Prometheus exporter has resource-to-telemetry conversion enabled:

exporters:
  prometheus:
    resource_to_telemetry_conversion:
      enabled: true

Authentication Failures

Verify token is set:
```
echo $CENTRAL_OTEL_TOKEN
```
Check central collector accepts the token
Verify TLS certificates are valid and not expired

Grafana Dashboard Guide - Dashboard with instance filtering
Telemetry Configuration - Configuration reference
Telemetry Infrastructure - Single-instance architecture
Grafana Integration - Query reference

Overview​

Architecture​

Recommended: Hierarchical Collectors​

Alternative: Direct Connection​

Collector Communication​

Protocol: OTLP (OpenTelemetry Protocol)​

Data Flow​

Processing Pipeline​

Buffering and Reliability​

Telemetry Signals​

Network Requirements​

Instance Identification​

Resource Attributes​

Where Attributes Are Added​

Configuration​

Instance Collector Setup​

Central Collector Setup​

Docker Compose Example​

Dashboard Configuration​

Instance Variable​

Updated Queries​

Instance Overview Panel​

Security​

Instance to Central Communication​

Data Isolation​

Querying Multi-Instance Data​

Prometheus (Metrics)​

Tempo (Traces)​

Loki (Logs)​

Troubleshooting​

Instance Not Appearing in Central Grafana​

Missing Instance Labels in Metrics​

Authentication Failures​

Related Topics​