Skip to main content

Grafana Dashboard Guide

A complete guide to building a unified observability dashboard for monitoring all AppServer instances.

Overview

This guide provides a ready-to-use dashboard concept that combines metrics, logs, and traces into a single view. The dashboard enables you to:

  • Monitor all services at a glance
  • Filter by instance for multi-tenant deployments
  • Identify slow and failing endpoints
  • View recent errors and warnings
  • Click through to detailed traces for investigation

Multi-Instance Support: When running multiple AppServer instances (e.g., for different customers), the dashboard supports filtering by instance.id to view global or per-instance metrics. See Multi-Instance Telemetry for architecture details.

Dashboard Layout

┌─────────────────────────────────────────────────────────────────────────────┐
│ AppServer Observability Dashboard │
│ [Instance: $instance ▼] [Service: $service ▼] [Time Range: Last 1h ▼] │
├─────────────────────────────────────────────────────────────────────────────┤
│ OVERVIEW ROW │
├───────────────┬───────────────┬───────────────┬───────────────┬─────────────┤
│ Requests │ Error Rate │ Avg Latency │ P95 Latency │ Active │
│ /min │ % │ ms │ ms │ Requests │
│ 1,234 │ 0.12% │ 45ms │ 120ms │ 42 │
│ ↑ 5.2% │ ↓ 0.03% │ ↑ 2ms │ ↓ 5ms │ → 0 │
├───────────────┴───────────────┴───────────────┴───────────────┴─────────────┤
│ REQUEST METRICS ROW │
├─────────────────────────────────────┬───────────────────────────────────────┤
│ Request Rate by Service │ Response Time Distribution │
│ ┌─────────────────────────────┐ │ ┌─────────────────────────────┐ │
│ │ ████████████████████ │ │ │ ▄▄▄ │ │
│ │ ██████████████ │ │ │ █████▄ │ │
│ │ ████████ │ │ │ ████████▄▄ │ │
│ │ ████ │ │ │ ████████████▄▄▄___ │ │
│ └─────────────────────────────┘ │ └─────────────────────────────┘ │
│ appserver ─── orchestrator ─── │ 0 50 100 200 500 1000 ms │
├─────────────────────────────────────┴───────────────────────────────────────┤
│ ENDPOINTS ROW │
├─────────────────────────────────────┬───────────────────────────────────────┤
│ Slowest Endpoints (p95) │ Failed Endpoints (5xx) │
│ ┌─────────────────────────────────┐│ ┌─────────────────────────────────┐ │
│ │ /api/reports/generate 850ms ││ │ /api/webhooks/process 12 err │ │
│ │ /api/documents/export 620ms ││ │ /api/sync/external 8 err │ │
│ │ /api/analytics/compute 480ms ││ │ /api/payments/validate 3 err │ │
│ │ /graphql 320ms ││ │ /api/users/bulk 2 err │ │
│ │ /api/search 280ms ││ │ /api/notifications 1 err │ │
│ └─────────────────────────────────┘│ └─────────────────────────────────┘ │
│ [Click row → View traces] │ [Click row → View traces] │
├─────────────────────────────────────┴───────────────────────────────────────┤
│ ERRORS ROW │
├─────────────────────────────────────┬───────────────────────────────────────┤
│ Error Rate Over Time │ Errors by Status Code │
│ ┌─────────────────────────────┐ │ ┌─────────────────────────────┐ │
│ │ ╱╲ │ │ │ 500 ████████████████ 156 │ │
│ │ ╱╲ ╱ ╲ ╱╲ │ │ │ 502 ████████ 52 │ │
│ │ __╱ ╲_╱ ╲__╱ ╲___ │ │ │ 503 █████ 31 │ │
│ └─────────────────────────────┘ │ │ 504 ███ 18 │ │
│ -1h -30m now │ │ 400 ██ 12 │ │
│ │ └─────────────────────────────┘ │
├─────────────────────────────────────┴───────────────────────────────────────┤
│ LOGS ROW │
├─────────────────────────────────────────────────────────────────────────────┤
│ Recent Errors (Last 100) │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ 14:32:15 [ERROR] appserver Connection refused to database ││
│ │ trace_id=abc123... [→ View Trace] ││
│ │ 14:31:58 [ERROR] orchestrator Container health check failed ││
│ │ trace_id=def456... [→ View Trace] ││
│ │ 14:31:42 [WARN] appserver Slow query detected (2.3s) ││
│ │ trace_id=ghi789... [→ View Trace] ││
│ └─────────────────────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────────────────┤
│ Recent Warnings (Last 100) │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ 14:32:10 [WARN] appserver Rate limit approaching for tenant xyz ││
│ │ 14:31:55 [WARN] orchestrator Memory usage above 80% threshold ││
│ │ 14:31:30 [WARN] appserver Deprecated API endpoint called ││
│ └─────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘

Setup Instructions

Step 1: Create a New Dashboard

  1. Open Grafana at http://localhost:3000
  2. Click DashboardsNewNew Dashboard
  3. Click the gear icon (Dashboard settings)
  4. Set Name: AppServer Observability
  5. Set Auto-refresh: 30s
  6. Click Save

Step 2: Add Variables

Go to Dashboard settingsVariablesAdd variable

Instance Variable (Multi-Instance Deployments)

Use this variable to filter by customer instance when running multiple AppServer deployments.

SettingValue
Nameinstance
TypeQuery
Data sourcePrometheus
Querylabel_values(appserver_http_requests_total, instance_id)
Multi-valueYes
Include All optionYes
All value.*

Note: The instance_id label is added by the OTEL Collector's resource processor. If you're running a single instance, this variable will have only one value. See Multi-Instance Telemetry for setup instructions.

Service Variable

SettingValue
Nameservice
TypeQuery
Data sourcePrometheus
Querylabel_values(appserver_http_requests_total{instance_id=~"$instance"}, job)
Multi-valueYes
Include All optionYes

Path Variable (for filtering)

SettingValue
Namepath
TypeQuery
Data sourcePrometheus
Querylabel_values(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}, path)
Multi-valueYes
Include All optionYes

Panel Configurations

Row 1: Overview Stats

Create a row named "Overview" with 5 stat panels.

Panel 1: Requests/min

SettingValue
TitleRequests/min
TypeStat
Data sourcePrometheus

Query:

sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) * 60

Options:

  • Unit: short
  • Color mode: Background
  • Graph mode: Area
  • Thresholds: 0 (green)

Panel 2: Error Rate

SettingValue
TitleError Rate
TypeStat
Data sourcePrometheus

Query:

sum(rate(appserver_http_requests_total{job=~"$service", status_group="5xx"}[5m]))
/
sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m]))
* 100

Options:

  • Unit: percent (0-100)
  • Color mode: Background
  • Thresholds: 0 (green), 1 (yellow), 5 (red)

Panel 3: Avg Latency

SettingValue
TitleAvg Latency
TypeStat
Data sourcePrometheus

Query:

sum(rate(appserver_http_request_duration_ms_sum{instance_id=~"$instance", job=~"$service"}[5m]))
/
sum(rate(appserver_http_request_duration_ms_count{instance_id=~"$instance", job=~"$service"}[5m]))

Options:

  • Unit: ms
  • Color mode: Background
  • Thresholds: 0 (green), 100 (yellow), 500 (red)

Panel 4: P95 Latency

SettingValue
TitleP95 Latency
TypeStat
Data sourcePrometheus

Query:

histogram_quantile(0.95,
sum(rate(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[5m])) by (le)
)

Options:

  • Unit: ms
  • Color mode: Background
  • Thresholds: 0 (green), 200 (yellow), 1000 (red)

Panel 5: Active Requests

SettingValue
TitleActive Requests
TypeStat
Data sourcePrometheus

Query:

sum(appserver_http_requests_active{instance_id=~"$instance", job=~"$service"})

Options:

  • Unit: short
  • Color mode: Background
  • Thresholds: 0 (green)

Row 1b: Instance Health (Multi-Instance Only)

When monitoring multiple instances, add this row to see health status across all deployments.

Panel: Instance Health Overview

SettingValue
TitleInstance Health Overview
TypeTable
Data sourcePrometheus

Query (Error Rate per Instance):

sum by (instance_id, instance_name) (
rate(appserver_http_requests_total{status_group="5xx"}[5m])
)
/
sum by (instance_id, instance_name) (
rate(appserver_http_requests_total[5m])
) * 100

Transformations:

  1. Labels to fields
  2. Organize fields: Show instance_id, instance_name, Value
  3. Rename ValueError Rate %

Options:

  • Cell display mode for Error Rate %: Color background (gradient)
  • Thresholds: 0 (green), 1 (yellow), 5 (red)

Data Links:

  • Title: View Instance
  • URL: /d/current?var-instance=${__data.fields.instance_id}

Panel: Requests by Instance

SettingValue
TitleRequests by Instance
TypeTime series
Data sourcePrometheus

Query:

sum(rate(appserver_http_requests_total[5m])) by (instance_id)

Legend: {{instance_id}}

Options:

  • Line interpolation: Smooth
  • Fill opacity: 10

Row 2: Request Metrics

Create a row named "Request Metrics" with 2 panels.

Panel 6: Request Rate by Service

SettingValue
TitleRequest Rate by Service
TypeTime series
Data sourcePrometheus

Query:

sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) by (job)

Legend: {{job}}

Options:

  • Line interpolation: Smooth
  • Fill opacity: 10
  • Show points: Never

Panel 7: Response Time Distribution

SettingValue
TitleResponse Time Distribution
TypeHistogram
Data sourcePrometheus

Query:

sum(increase(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[$__rate_interval])) by (le)

Options:

  • Bucket size: Auto
  • Combine histogram: Yes

Row 3: Endpoints

Create a row named "Endpoints" with 2 table panels.

Panel 8: Slowest Endpoints (p95)

SettingValue
TitleSlowest Endpoints (p95)
TypeTable
Data sourcePrometheus

Query:

topk(10,
histogram_quantile(0.95,
sum(rate(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[5m])) by (le, path, method)
)
)

Transformations:

  1. Labels to fields
  2. Organize fields: Show path, method, Value
  3. Rename ValueP95 (ms)

Options:

  • Column width: path = 300px
  • Unit for Value: ms

Data Links (for drill-down to traces):

  • Title: View Traces
  • URL: /explore?left={"datasource":"tempo","queries":[{"refId":"A","queryType":"traceql","query":"{resource.service.name=\"$service\" && name=~\".*${__data.fields.path}.*\"}"}]}

Panel 9: Failed Endpoints (5xx)

SettingValue
TitleFailed Endpoints (5xx)
TypeTable
Data sourcePrometheus

Query:

topk(10,
sum(increase(appserver_http_requests_total{job=~"$service", status_group="5xx"}[1h])) by (path, method)
)

Transformations:

  1. Labels to fields
  2. Organize fields: Show path, method, Value
  3. Rename ValueErrors

Options:

  • Column width: path = 300px
  • Cell display mode for Errors: Color background
  • Thresholds: 1 (yellow), 10 (red)

Data Links:

  • Title: View Error Traces
  • URL: /explore?left={"datasource":"tempo","queries":[{"refId":"A","queryType":"traceql","query":"{resource.service.name=\"$service\" && name=~\".*${__data.fields.path}.*\" && status=error}"}]}

Row 4: Errors

Create a row named "Errors" with 2 panels.

Panel 10: Error Rate Over Time

SettingValue
TitleError Rate Over Time
TypeTime series
Data sourcePrometheus

Query A (5xx errors):

sum(rate(appserver_http_requests_total{job=~"$service", status_group="5xx"}[5m])) by (job)

Legend: {{job}} - 5xx

Query B (4xx errors):

sum(rate(appserver_http_requests_total{job=~"$service", status_group="4xx"}[5m])) by (job)

Legend: {{job}} - 4xx

Options:

  • Color: Query A = red, Query B = orange
  • Fill opacity: 20

Panel 11: Errors by Status Code

SettingValue
TitleErrors by Status Code
TypeBar gauge
Data sourcePrometheus

Query:

topk(10,
sum(increase(appserver_http_requests_total{job=~"$service", status_group=~"4xx|5xx"}[1h])) by (status)
)

Options:

  • Orientation: Horizontal
  • Display mode: Gradient
  • Color: Red
  • Text mode: Value and name

Row 5: Logs

Create a row named "Logs" with 2 log panels.

Panel 12: Recent Errors

SettingValue
TitleRecent Errors
TypeLogs
Data sourceLoki

Query:

{instance_id=~"$instance", service_name=~"$service"} | json | level = `error`

Options:

  • Order: Newest first
  • Enable log details: Yes
  • Show time: Yes
  • Wrap lines: Yes

The trace_id field automatically becomes a clickable link to Tempo (configured via Loki derived fields).

Panel 13: Recent Warnings

SettingValue
TitleRecent Warnings
TypeLogs
Data sourceLoki

Query:

{instance_id=~"$instance", service_name=~"$service"} | json | level = `warn`

Options:

  • Same as Recent Errors panel

Drill-Down Configuration

Metrics to Traces

To link from a metric panel to traces, add a Data Link:

  1. Edit the panel
  2. Go to Field tab → Data linksAdd link
  3. Configure:
SettingValue
TitleView Traces
URLSee below

URL Template for Tempo:

/explore?left={"datasource":"tempo","queries":[{"refId":"A","queryType":"traceql","query":"{resource.service.name=\"${__field.labels.job}\" && duration > 100ms}"}],"range":{"from":"${__from}","to":"${__to}"}}

Logs to Traces

Logs automatically link to traces via Loki's derived fields configuration. When you click on a log entry containing trace_id, it opens the trace in Tempo.

This is pre-configured in docker/observability/grafana/provisioning/datasources/datasources.yml:

derivedFields:
- datasourceUid: tempo
matcherRegex: '"trace_id":"(\\w+)"'
name: TraceID
url: "$${__value.raw}"

Table Row to Traces

For table panels (Slowest Endpoints, Failed Endpoints), use field-based data links:

/explore?left={"datasource":"tempo","queries":[{"refId":"A","queryType":"traceql","query":"{resource.service.name=\"$service\" && name=~\".*${__data.fields.path}.*\"}"}]}

Available field variables:

  • ${__data.fields.path} - The path column value
  • ${__data.fields.method} - The HTTP method
  • ${__data.fields.status} - The status code

Quick Reference: All Queries

Stat Panels

# Requests per minute
sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) * 60

# Error rate percentage
sum(rate(appserver_http_requests_total{job=~"$service", status_group="5xx"}[5m]))
/ sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) * 100

# Average latency
sum(rate(appserver_http_request_duration_ms_sum{instance_id=~"$instance", job=~"$service"}[5m]))
/ sum(rate(appserver_http_request_duration_ms_count{instance_id=~"$instance", job=~"$service"}[5m]))

# P95 latency
histogram_quantile(0.95,
sum(rate(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[5m])) by (le))

# Active requests
sum(appserver_http_requests_active{instance_id=~"$instance", job=~"$service"})

Time Series

# Request rate by service
sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) by (job)

# Error rate over time
sum(rate(appserver_http_requests_total{job=~"$service", status_group="5xx"}[5m])) by (job)

Tables

# Slowest endpoints
topk(10, histogram_quantile(0.95,
sum(rate(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[5m])) by (le, path, method)))

# Failed endpoints
topk(10, sum(increase(appserver_http_requests_total{job=~"$service", status_group="5xx"}[1h])) by (path, method))

# Errors by status
topk(10, sum(increase(appserver_http_requests_total{job=~"$service", status_group=~"4xx|5xx"}[1h])) by (status))

Logs

# Recent errors
{instance_id=~"$instance", service_name=~"$service"} | json | level = `error`

# Recent warnings
{instance_id=~"$instance", service_name=~"$service"} | json | level = `warn`

# Errors with specific message
{instance_id=~"$instance", service_name=~"$service"} | json | level = `error` |= `connection refused`

# Slow requests (from logs)
{instance_id=~"$instance", service_name=~"$service"} | json | duration_ms > 1000

Alerting

Add alert rules to key panels for proactive monitoring.

High Error Rate Alert

  1. Edit the Error Rate stat panel
  2. Go to Alert tab → Create alert rule
SettingValue
ConditionWHEN last() OF A IS ABOVE 5
Evaluate every1m
For5m
Alert nameHigh Error Rate
SummaryError rate is $value%

High Latency Alert

SettingValue
ConditionWHEN last() OF A IS ABOVE 1000
Evaluate every1m
For5m
Alert nameHigh P95 Latency
SummaryP95 latency is $value ms

Dashboard JSON Export

After creating the dashboard, export it for version control:

  1. Go to Dashboard settingsJSON Model
  2. Copy the JSON
  3. Save to docker/observability/grafana/provisioning/dashboards/observability/appserver-dashboard.json

The dashboard will auto-load on next Grafana restart.