Grafana Dashboard Guide

A complete guide to building a unified observability dashboard for monitoring all AppServer instances.

Overview

This guide provides a ready-to-use dashboard concept that combines metrics, logs, and traces into a single view. The dashboard enables you to:

Monitor all services at a glance
Filter by instance for multi-tenant deployments
Identify slow and failing endpoints
View recent errors and warnings
Click through to detailed traces for investigation

Multi-Instance Support: When running multiple AppServer instances (e.g., for different customers), the dashboard supports filtering by instance.id to view global or per-instance metrics. See Multi-Instance Telemetry for architecture details.

Dashboard Layout

┌─────────────────────────────────────────────────────────────────────────────┐
│                     AppServer Observability Dashboard                        │
│  [Instance: $instance ▼]  [Service: $service ▼]  [Time Range: Last 1h ▼]   │
├─────────────────────────────────────────────────────────────────────────────┤
│                              OVERVIEW ROW                                    │
├───────────────┬───────────────┬───────────────┬───────────────┬─────────────┤
│   Requests    │  Error Rate   │   Avg Latency │  P95 Latency  │   Active    │
│    /min       │      %        │      ms       │      ms       │  Requests   │
│    1,234      │    0.12%      │     45ms      │    120ms      │     42      │
│   ↑ 5.2%      │   ↓ 0.03%     │   ↑ 2ms       │   ↓ 5ms       │   → 0       │
├───────────────┴───────────────┴───────────────┴───────────────┴─────────────┤
│                           REQUEST METRICS ROW                                │
├─────────────────────────────────────┬───────────────────────────────────────┤
│      Request Rate by Service        │       Response Time Distribution      │
│  ┌─────────────────────────────┐    │    ┌─────────────────────────────┐    │
│  │ ████████████████████        │    │    │    ▄▄▄                      │    │
│  │ ██████████████              │    │    │   █████▄                    │    │
│  │ ████████                    │    │    │  ████████▄▄                 │    │
│  │ ████                        │    │    │ ████████████▄▄▄___          │    │
│  └─────────────────────────────┘    │    └─────────────────────────────┘    │
│   appserver ─── orchestrator ───    │     0   50   100  200  500  1000 ms   │
├─────────────────────────────────────┴───────────────────────────────────────┤
│                            ENDPOINTS ROW                                     │
├─────────────────────────────────────┬───────────────────────────────────────┤
│       Slowest Endpoints (p95)       │        Failed Endpoints (5xx)         │
│  ┌─────────────────────────────────┐│  ┌─────────────────────────────────┐  │
│  │ /api/reports/generate    850ms  ││  │ /api/webhooks/process    12 err │  │
│  │ /api/documents/export    620ms  ││  │ /api/sync/external        8 err │  │
│  │ /api/analytics/compute   480ms  ││  │ /api/payments/validate    3 err │  │
│  │ /graphql                 320ms  ││  │ /api/users/bulk           2 err │  │
│  │ /api/search              280ms  ││  │ /api/notifications        1 err │  │
│  └─────────────────────────────────┘│  └─────────────────────────────────┘  │
│   [Click row → View traces]         │   [Click row → View traces]           │
├─────────────────────────────────────┴───────────────────────────────────────┤
│                              ERRORS ROW                                      │
├─────────────────────────────────────┬───────────────────────────────────────┤
│         Error Rate Over Time        │       Errors by Status Code           │
│  ┌─────────────────────────────┐    │    ┌─────────────────────────────┐    │
│  │          ╱╲                 │    │    │ 500 ████████████████  156   │    │
│  │    ╱╲   ╱  ╲    ╱╲         │    │    │ 502 ████████          52    │    │
│  │ __╱  ╲_╱    ╲__╱  ╲___     │    │    │ 503 █████             31    │    │
│  └─────────────────────────────┘    │    │ 504 ███               18    │    │
│   -1h        -30m         now       │    │ 400 ██                12    │    │
│                                     │    └─────────────────────────────┘    │
├─────────────────────────────────────┴───────────────────────────────────────┤
│                              LOGS ROW                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                        Recent Errors (Last 100)                              │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │ 14:32:15 [ERROR] appserver    Connection refused to database            ││
│  │                               trace_id=abc123... [→ View Trace]         ││
│  │ 14:31:58 [ERROR] orchestrator Container health check failed             ││
│  │                               trace_id=def456... [→ View Trace]         ││
│  │ 14:31:42 [WARN]  appserver    Slow query detected (2.3s)                ││
│  │                               trace_id=ghi789... [→ View Trace]         ││
│  └─────────────────────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────────────────┤
│                        Recent Warnings (Last 100)                            │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │ 14:32:10 [WARN] appserver    Rate limit approaching for tenant xyz      ││
│  │ 14:31:55 [WARN] orchestrator Memory usage above 80% threshold           ││
│  │ 14:31:30 [WARN] appserver    Deprecated API endpoint called             ││
│  └─────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘

Setup Instructions

Step 1: Create a New Dashboard

Open Grafana at http://localhost:3000
Click Dashboards → New → New Dashboard
Click the gear icon (Dashboard settings)
Set Name: AppServer Observability
Set Auto-refresh: 30s
Click Save

Step 2: Add Variables

Go to Dashboard settings → Variables → Add variable

Instance Variable (Multi-Instance Deployments)

Use this variable to filter by customer instance when running multiple AppServer deployments.

Setting	Value
Name	`instance`
Type	Query
Data source	Prometheus
Query	`label_values(appserver_http_requests_total, instance_id)`
Multi-value	Yes
Include All option	Yes
All value	`.*`

Note: The instance_id label is added by the OTEL Collector's resource processor. If you're running a single instance, this variable will have only one value. See Multi-Instance Telemetry for setup instructions.

Service Variable

Setting	Value
Name	`service`
Type	Query
Data source	Prometheus
Query	`label_values(appserver_http_requests_total{instance_id=~"$instance"}, job)`
Multi-value	Yes
Include All option	Yes

Path Variable (for filtering)

Setting	Value
Name	`path`
Type	Query
Data source	Prometheus
Query	`label_values(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}, path)`
Multi-value	Yes
Include All option	Yes

Panel Configurations

Row 1: Overview Stats

Create a row named "Overview" with 5 stat panels.

Panel 1: Requests/min

Setting	Value
Title	Requests/min
Type	Stat
Data source	Prometheus

Query:

sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) * 60

Options:

Unit: short
Color mode: Background
Graph mode: Area
Thresholds: 0 (green)

Panel 2: Error Rate

Setting	Value
Title	Error Rate
Type	Stat
Data source	Prometheus

Query:

sum(rate(appserver_http_requests_total{job=~"$service", status_group="5xx"}[5m]))
/
sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m]))
* 100

Options:

Unit: percent (0-100)
Color mode: Background
Thresholds: 0 (green), 1 (yellow), 5 (red)

Panel 3: Avg Latency

Setting	Value
Title	Avg Latency
Type	Stat
Data source	Prometheus

Query:

sum(rate(appserver_http_request_duration_ms_sum{instance_id=~"$instance", job=~"$service"}[5m]))
/
sum(rate(appserver_http_request_duration_ms_count{instance_id=~"$instance", job=~"$service"}[5m]))

Options:

Unit: ms
Color mode: Background
Thresholds: 0 (green), 100 (yellow), 500 (red)

Panel 4: P95 Latency

Setting	Value
Title	P95 Latency
Type	Stat
Data source	Prometheus

Query:

histogram_quantile(0.95,
  sum(rate(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[5m])) by (le)
)

Options:

Unit: ms
Color mode: Background
Thresholds: 0 (green), 200 (yellow), 1000 (red)

Panel 5: Active Requests

Setting	Value
Title	Active Requests
Type	Stat
Data source	Prometheus

Query:

sum(appserver_http_requests_active{instance_id=~"$instance", job=~"$service"})

Options:

Unit: short
Color mode: Background
Thresholds: 0 (green)

Row 1b: Instance Health (Multi-Instance Only)

When monitoring multiple instances, add this row to see health status across all deployments.

Panel: Instance Health Overview

Setting	Value
Title	Instance Health Overview
Type	Table
Data source	Prometheus

Query (Error Rate per Instance):

sum by (instance_id, instance_name) (
  rate(appserver_http_requests_total{status_group="5xx"}[5m])
)
/
sum by (instance_id, instance_name) (
  rate(appserver_http_requests_total[5m])
) * 100

Transformations:

Labels to fields
Organize fields: Show instance_id, instance_name, Value
Rename Value → Error Rate %

Options:

Cell display mode for Error Rate %: Color background (gradient)
Thresholds: 0 (green), 1 (yellow), 5 (red)

Data Links:

Title: View Instance
URL: /d/current?var-instance=${__data.fields.instance_id}

Panel: Requests by Instance

Setting	Value
Title	Requests by Instance
Type	Time series
Data source	Prometheus

Query:

sum(rate(appserver_http_requests_total[5m])) by (instance_id)

Legend: {{instance_id}}

Options:

Line interpolation: Smooth
Fill opacity: 10

Row 2: Request Metrics

Create a row named "Request Metrics" with 2 panels.

Panel 6: Request Rate by Service

Setting	Value
Title	Request Rate by Service
Type	Time series
Data source	Prometheus

Query:

sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) by (job)

Legend: {{job}}

Options:

Line interpolation: Smooth
Fill opacity: 10
Show points: Never

Panel 7: Response Time Distribution

Setting	Value
Title	Response Time Distribution
Type	Histogram
Data source	Prometheus

Query:

sum(increase(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[$__rate_interval])) by (le)

Options:

Bucket size: Auto
Combine histogram: Yes

Row 3: Endpoints

Create a row named "Endpoints" with 2 table panels.

Panel 8: Slowest Endpoints (p95)

Setting	Value
Title	Slowest Endpoints (p95)
Type	Table
Data source	Prometheus

Query:

topk(10,
  histogram_quantile(0.95,
    sum(rate(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[5m])) by (le, path, method)
  )
)

Transformations:

Labels to fields
Organize fields: Show path, method, Value
Rename Value → P95 (ms)

Options:

Column width: path = 300px
Unit for Value: ms

Data Links (for drill-down to traces):

Title: View Traces
URL: /explore?left={"datasource":"tempo","queries":[{"refId":"A","queryType":"traceql","query":"{resource.service.name=\"$service\" && name=~\".*${__data.fields.path}.*\"}"}]}

Panel 9: Failed Endpoints (5xx)

Setting	Value
Title	Failed Endpoints (5xx)
Type	Table
Data source	Prometheus

Query:

topk(10,
  sum(increase(appserver_http_requests_total{job=~"$service", status_group="5xx"}[1h])) by (path, method)
)

Transformations:

Labels to fields
Organize fields: Show path, method, Value
Rename Value → Errors

Options:

Column width: path = 300px
Cell display mode for Errors: Color background
Thresholds: 1 (yellow), 10 (red)

Data Links:

Title: View Error Traces
URL: /explore?left={"datasource":"tempo","queries":[{"refId":"A","queryType":"traceql","query":"{resource.service.name=\"$service\" && name=~\".*${__data.fields.path}.*\" && status=error}"}]}

Row 4: Errors

Create a row named "Errors" with 2 panels.

Panel 10: Error Rate Over Time

Setting	Value
Title	Error Rate Over Time
Type	Time series
Data source	Prometheus

Query A (5xx errors):

sum(rate(appserver_http_requests_total{job=~"$service", status_group="5xx"}[5m])) by (job)

Legend: {{job}} - 5xx

Query B (4xx errors):

sum(rate(appserver_http_requests_total{job=~"$service", status_group="4xx"}[5m])) by (job)

Legend: {{job}} - 4xx

Options:

Color: Query A = red, Query B = orange
Fill opacity: 20

Panel 11: Errors by Status Code

Setting	Value
Title	Errors by Status Code
Type	Bar gauge
Data source	Prometheus

Query:

topk(10,
  sum(increase(appserver_http_requests_total{job=~"$service", status_group=~"4xx|5xx"}[1h])) by (status)
)

Options:

Orientation: Horizontal
Display mode: Gradient
Color: Red
Text mode: Value and name

Row 5: Logs

Create a row named "Logs" with 2 log panels.

Panel 12: Recent Errors

Setting	Value
Title	Recent Errors
Type	Logs
Data source	Loki

Query:

{instance_id=~"$instance", service_name=~"$service"} | json | level = `error`

Options:

Order: Newest first
Enable log details: Yes
Show time: Yes
Wrap lines: Yes

The trace_id field automatically becomes a clickable link to Tempo (configured via Loki derived fields).

Panel 13: Recent Warnings

Setting	Value
Title	Recent Warnings
Type	Logs
Data source	Loki

Query:

{instance_id=~"$instance", service_name=~"$service"} | json | level = `warn`

Options:

Same as Recent Errors panel

Drill-Down Configuration

Metrics to Traces

To link from a metric panel to traces, add a Data Link:

Edit the panel
Go to Field tab → Data links → Add link
Configure:

Setting	Value
Title	View Traces
URL	See below

URL Template for Tempo:

/explore?left={"datasource":"tempo","queries":[{"refId":"A","queryType":"traceql","query":"{resource.service.name=\"${__field.labels.job}\" && duration > 100ms}"}],"range":{"from":"${__from}","to":"${__to}"}}

Logs to Traces

Logs automatically link to traces via Loki's derived fields configuration. When you click on a log entry containing trace_id, it opens the trace in Tempo.

This is pre-configured in docker/observability/grafana/provisioning/datasources/datasources.yml:

derivedFields:
  - datasourceUid: tempo
    matcherRegex: '"trace_id":"(\\w+)"'
    name: TraceID
    url: "$${__value.raw}"

Table Row to Traces

For table panels (Slowest Endpoints, Failed Endpoints), use field-based data links:

/explore?left={"datasource":"tempo","queries":[{"refId":"A","queryType":"traceql","query":"{resource.service.name=\"$service\" && name=~\".*${__data.fields.path}.*\"}"}]}

Available field variables:

${__data.fields.path} - The path column value
${__data.fields.method} - The HTTP method
${__data.fields.status} - The status code

Quick Reference: All Queries

Stat Panels

# Requests per minute
sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) * 60

# Error rate percentage
sum(rate(appserver_http_requests_total{job=~"$service", status_group="5xx"}[5m]))
/ sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) * 100

# Average latency
sum(rate(appserver_http_request_duration_ms_sum{instance_id=~"$instance", job=~"$service"}[5m]))
/ sum(rate(appserver_http_request_duration_ms_count{instance_id=~"$instance", job=~"$service"}[5m]))

# P95 latency
histogram_quantile(0.95,
  sum(rate(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[5m])) by (le))

# Active requests
sum(appserver_http_requests_active{instance_id=~"$instance", job=~"$service"})

Time Series

# Request rate by service
sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) by (job)

# Error rate over time
sum(rate(appserver_http_requests_total{job=~"$service", status_group="5xx"}[5m])) by (job)

Tables

# Slowest endpoints
topk(10, histogram_quantile(0.95,
  sum(rate(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[5m])) by (le, path, method)))

# Failed endpoints
topk(10, sum(increase(appserver_http_requests_total{job=~"$service", status_group="5xx"}[1h])) by (path, method))

# Errors by status
topk(10, sum(increase(appserver_http_requests_total{job=~"$service", status_group=~"4xx|5xx"}[1h])) by (status))

Logs

# Recent errors
{instance_id=~"$instance", service_name=~"$service"} | json | level = `error`

# Recent warnings
{instance_id=~"$instance", service_name=~"$service"} | json | level = `warn`

# Errors with specific message
{instance_id=~"$instance", service_name=~"$service"} | json | level = `error` |= `connection refused`

# Slow requests (from logs)
{instance_id=~"$instance", service_name=~"$service"} | json | duration_ms > 1000

Alerting

Add alert rules to key panels for proactive monitoring.

High Error Rate Alert

Edit the Error Rate stat panel
Go to Alert tab → Create alert rule

Setting	Value
Condition	WHEN `last()` OF `A` IS ABOVE `5`
Evaluate every	`1m`
For	`5m`
Alert name	High Error Rate
Summary	`Error rate is $value%`

High Latency Alert

Setting	Value
Condition	WHEN `last()` OF `A` IS ABOVE `1000`
Evaluate every	`1m`
For	`5m`
Alert name	High P95 Latency
Summary	`P95 latency is $value ms`

Dashboard JSON Export

After creating the dashboard, export it for version control:

Go to Dashboard settings → JSON Model
Copy the JSON
Save to docker/observability/grafana/provisioning/dashboards/observability/appserver-dashboard.json

The dashboard will auto-load on next Grafana restart.

Grafana Integration - Query reference and basics
Telemetry Configuration - Configure telemetry export
Telemetry Infrastructure - Backend architecture
Multi-Instance Telemetry - Architecture for monitoring multiple customer instances

Overview​

Dashboard Layout​

Setup Instructions​

Step 1: Create a New Dashboard​

Step 2: Add Variables​

Instance Variable (Multi-Instance Deployments)​

Service Variable​

Path Variable (for filtering)​

Panel Configurations​

Row 1: Overview Stats​

Panel 1: Requests/min​

Panel 2: Error Rate​

Panel 3: Avg Latency​

Panel 4: P95 Latency​

Panel 5: Active Requests​

Row 1b: Instance Health (Multi-Instance Only)​

Panel: Instance Health Overview​

Panel: Requests by Instance​

Row 2: Request Metrics​

Panel 6: Request Rate by Service​

Panel 7: Response Time Distribution​

Row 3: Endpoints​

Panel 8: Slowest Endpoints (p95)​

Panel 9: Failed Endpoints (5xx)​

Row 4: Errors​

Panel 10: Error Rate Over Time​

Panel 11: Errors by Status Code​

Row 5: Logs​

Panel 12: Recent Errors​

Panel 13: Recent Warnings​

Drill-Down Configuration​

Metrics to Traces​

Logs to Traces​

Table Row to Traces​

Quick Reference: All Queries​

Stat Panels​

Time Series​

Tables​

Logs​

Alerting​

High Error Rate Alert​

High Latency Alert​

Dashboard JSON Export​

Related Topics​

Overview

Dashboard Layout

Setup Instructions

Step 1: Create a New Dashboard

Step 2: Add Variables

Instance Variable (Multi-Instance Deployments)

Service Variable

Path Variable (for filtering)

Panel Configurations

Row 1: Overview Stats

Panel 1: Requests/min

Panel 2: Error Rate

Panel 3: Avg Latency

Panel 4: P95 Latency

Panel 5: Active Requests

Row 1b: Instance Health (Multi-Instance Only)

Panel: Instance Health Overview

Panel: Requests by Instance

Row 2: Request Metrics

Panel 6: Request Rate by Service

Panel 7: Response Time Distribution

Row 3: Endpoints

Panel 8: Slowest Endpoints (p95)

Panel 9: Failed Endpoints (5xx)

Row 4: Errors

Panel 10: Error Rate Over Time

Panel 11: Errors by Status Code

Row 5: Logs

Panel 12: Recent Errors

Panel 13: Recent Warnings

Drill-Down Configuration

Metrics to Traces

Logs to Traces

Table Row to Traces

Quick Reference: All Queries

Stat Panels

Time Series

Tables

Logs

Alerting

High Error Rate Alert

High Latency Alert

Dashboard JSON Export

Related Topics