Grafana Dashboard Guide
A complete guide to building a unified observability dashboard for monitoring all AppServer instances.
Overview
This guide provides a ready-to-use dashboard concept that combines metrics, logs, and traces into a single view. The dashboard enables you to:
- Monitor all services at a glance
- Filter by instance for multi-tenant deployments
- Identify slow and failing endpoints
- View recent errors and warnings
- Click through to detailed traces for investigation
Multi-Instance Support: When running multiple AppServer instances (e.g., for different customers), the dashboard supports filtering by instance.id to view global or per-instance metrics. See Multi-Instance Telemetry for architecture details.
Dashboard Layout
┌─────────────────────────────────────────────────────────────────────────────┐
│ AppServer Observability Dashboard │
│ [Instance: $instance ▼] [Service: $service ▼] [Time Range: Last 1h ▼] │
├─────────────────────────────────────────────────────────────────────────────┤
│ OVERVIEW ROW │
├───────────────┬───────────────┬───────────────┬───────────────┬─────────────┤
│ Requests │ Error Rate │ Avg Latency │ P95 Latency │ Active │
│ /min │ % │ ms │ ms │ Requests │
│ 1,234 │ 0.12% │ 45ms │ 120ms │ 42 │
│ ↑ 5.2% │ ↓ 0.03% │ ↑ 2ms │ ↓ 5ms │ → 0 │
├───────────────┴───────────────┴───────────────┴───────────────┴─────────────┤
│ REQUEST METRICS ROW │
├─────────────────────────────────────┬───────────────────────────────────────┤
│ Request Rate by Service │ Response Time Distribution │
│ ┌─────────────────────────────┐ │ ┌─────────────────────────────┐ │
│ │ ████████████████████ │ │ │ ▄▄▄ │ │
│ │ ██████████████ │ │ │ █████▄ │ │
│ │ ████████ │ │ │ ████████▄▄ │ │
│ │ ████ │ │ │ ████████████▄▄▄___ │ │
│ └─────────────────────────────┘ │ └─────────────────────────────┘ │
│ appserver ─── orchestrator ─── │ 0 50 100 200 500 1000 ms │
├─────────────────────────────────────┴───────────────────────────────────────┤
│ ENDPOINTS ROW │
├─────────────────────────────────────┬───────────────────────────────────────┤
│ Slowest Endpoints (p95) │ Failed Endpoints (5xx) │
│ ┌─────────────────────────────────┐│ ┌─────────────────────────────────┐ │
│ │ /api/reports/generate 850ms ││ │ /api/webhooks/process 12 err │ │
│ │ /api/documents/export 620ms ││ │ /api/sync/external 8 err │ │
│ │ /api/analytics/compute 480ms ││ │ /api/payments/validate 3 err │ │
│ │ /graphql 320ms ││ │ /api/users/bulk 2 err │ │
│ │ /api/search 280ms ││ │ /api/notifications 1 err │ │
│ └─────────────────────────────────┘│ └─────────────────────────────────┘ │
│ [Click row → View traces] │ [Click row → View traces] │
├─────────────────────────────────────┴───────────────────────────────────────┤
│ ERRORS ROW │
├─────────────────────────────────────┬───────────────────────────────────────┤
│ Error Rate Over Time │ Errors by Status Code │
│ ┌─────────────────────────────┐ │ ┌─────────────────────────────┐ │
│ │ ╱╲ │ │ │ 500 ████████████████ 156 │ │
│ │ ╱╲ ╱ ╲ ╱╲ │ │ │ 502 ████████ 52 │ │
│ │ __╱ ╲_╱ ╲__╱ ╲___ │ │ │ 503 █████ 31 │ │
│ └─────────────────────────────┘ │ │ 504 ███ 18 │ │
│ -1h -30m now │ │ 400 ██ 12 │ │
│ │ └─────────────────────────────┘ │
├─────────────────────────────────────┴───────────────────────────────────────┤
│ LOGS ROW │
├─────────────────────────────────────────────────────────────────────────────┤
│ Recent Errors (Last 100) │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ 14:32:15 [ERROR] appserver Connection refused to database ││
│ │ trace_id=abc123... [→ View Trace] ││
│ │ 14:31:58 [ERROR] orchestrator Container health check failed ││
│ │ trace_id=def456... [→ View Trace] ││
│ │ 14:31:42 [WARN] appserver Slow query detected (2.3s) ││
│ │ trace_id=ghi789... [→ View Trace] ││
│ └─────────────────────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────────────────┤
│ Recent Warnings (Last 100) │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ 14:32:10 [WARN] appserver Rate limit approaching for tenant xyz ││
│ │ 14:31:55 [WARN] orchestrator Memory usage above 80% threshold ││
│ │ 14:31:30 [WARN] appserver Deprecated API endpoint called ││
│ └─────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘
Setup Instructions
Step 1: Create a New Dashboard
- Open Grafana at
http://localhost:3000 - Click Dashboards → New → New Dashboard
- Click the gear icon (Dashboard settings)
- Set Name:
AppServer Observability - Set Auto-refresh:
30s - Click Save
Step 2: Add Variables
Go to Dashboard settings → Variables → Add variable
Instance Variable (Multi-Instance Deployments)
Use this variable to filter by customer instance when running multiple AppServer deployments.
| Setting | Value |
|---|---|
| Name | instance |
| Type | Query |
| Data source | Prometheus |
| Query | label_values(appserver_http_requests_total, instance_id) |
| Multi-value | Yes |
| Include All option | Yes |
| All value | .* |
Note: The instance_id label is added by the OTEL Collector's resource processor. If you're running a single instance, this variable will have only one value. See Multi-Instance Telemetry for setup instructions.
Service Variable
| Setting | Value |
|---|---|
| Name | service |
| Type | Query |
| Data source | Prometheus |
| Query | label_values(appserver_http_requests_total{instance_id=~"$instance"}, job) |
| Multi-value | Yes |
| Include All option | Yes |
Path Variable (for filtering)
| Setting | Value |
|---|---|
| Name | path |
| Type | Query |
| Data source | Prometheus |
| Query | label_values(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}, path) |
| Multi-value | Yes |
| Include All option | Yes |
Panel Configurations
Row 1: Overview Stats
Create a row named "Overview" with 5 stat panels.
Panel 1: Requests/min
| Setting | Value |
|---|---|
| Title | Requests/min |
| Type | Stat |
| Data source | Prometheus |
Query:
sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) * 60
Options:
- Unit:
short - Color mode:
Background - Graph mode:
Area - Thresholds:
0(green)
Panel 2: Error Rate
| Setting | Value |
|---|---|
| Title | Error Rate |
| Type | Stat |
| Data source | Prometheus |
Query:
sum(rate(appserver_http_requests_total{job=~"$service", status_group="5xx"}[5m]))
/
sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m]))
* 100
Options:
- Unit:
percent (0-100) - Color mode:
Background - Thresholds:
0(green),1(yellow),5(red)
Panel 3: Avg Latency
| Setting | Value |
|---|---|
| Title | Avg Latency |
| Type | Stat |
| Data source | Prometheus |
Query:
sum(rate(appserver_http_request_duration_ms_sum{instance_id=~"$instance", job=~"$service"}[5m]))
/
sum(rate(appserver_http_request_duration_ms_count{instance_id=~"$instance", job=~"$service"}[5m]))
Options:
- Unit:
ms - Color mode:
Background - Thresholds:
0(green),100(yellow),500(red)
Panel 4: P95 Latency
| Setting | Value |
|---|---|
| Title | P95 Latency |
| Type | Stat |
| Data source | Prometheus |
Query:
histogram_quantile(0.95,
sum(rate(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[5m])) by (le)
)
Options:
- Unit:
ms - Color mode:
Background - Thresholds:
0(green),200(yellow),1000(red)
Panel 5: Active Requests
| Setting | Value |
|---|---|
| Title | Active Requests |
| Type | Stat |
| Data source | Prometheus |
Query:
sum(appserver_http_requests_active{instance_id=~"$instance", job=~"$service"})
Options:
- Unit:
short - Color mode:
Background - Thresholds:
0(green)
Row 1b: Instance Health (Multi-Instance Only)
When monitoring multiple instances, add this row to see health status across all deployments.
Panel: Instance Health Overview
| Setting | Value |
|---|---|
| Title | Instance Health Overview |
| Type | Table |
| Data source | Prometheus |
Query (Error Rate per Instance):
sum by (instance_id, instance_name) (
rate(appserver_http_requests_total{status_group="5xx"}[5m])
)
/
sum by (instance_id, instance_name) (
rate(appserver_http_requests_total[5m])
) * 100
Transformations:
- Labels to fields
- Organize fields: Show
instance_id,instance_name,Value - Rename
Value→Error Rate %
Options:
- Cell display mode for
Error Rate %:Color background (gradient) - Thresholds:
0(green),1(yellow),5(red)
Data Links:
- Title:
View Instance - URL:
/d/current?var-instance=${__data.fields.instance_id}
Panel: Requests by Instance
| Setting | Value |
|---|---|
| Title | Requests by Instance |
| Type | Time series |
| Data source | Prometheus |
Query:
sum(rate(appserver_http_requests_total[5m])) by (instance_id)
Legend: {{instance_id}}
Options:
- Line interpolation:
Smooth - Fill opacity:
10
Row 2: Request Metrics
Create a row named "Request Metrics" with 2 panels.
Panel 6: Request Rate by Service
| Setting | Value |
|---|---|
| Title | Request Rate by Service |
| Type | Time series |
| Data source | Prometheus |
Query:
sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) by (job)
Legend: {{job}}
Options:
- Line interpolation:
Smooth - Fill opacity:
10 - Show points:
Never
Panel 7: Response Time Distribution
| Setting | Value |
|---|---|
| Title | Response Time Distribution |
| Type | Histogram |
| Data source | Prometheus |
Query:
sum(increase(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[$__rate_interval])) by (le)
Options:
- Bucket size:
Auto - Combine histogram:
Yes
Row 3: Endpoints
Create a row named "Endpoints" with 2 table panels.
Panel 8: Slowest Endpoints (p95)
| Setting | Value |
|---|---|
| Title | Slowest Endpoints (p95) |
| Type | Table |
| Data source | Prometheus |
Query:
topk(10,
histogram_quantile(0.95,
sum(rate(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[5m])) by (le, path, method)
)
)
Transformations:
- Labels to fields
- Organize fields: Show
path,method,Value - Rename
Value→P95 (ms)
Options:
- Column width:
path= 300px - Unit for Value:
ms
Data Links (for drill-down to traces):
- Title:
View Traces - URL:
/explore?left={"datasource":"tempo","queries":[{"refId":"A","queryType":"traceql","query":"{resource.service.name=\"$service\" && name=~\".*${__data.fields.path}.*\"}"}]}
Panel 9: Failed Endpoints (5xx)
| Setting | Value |
|---|---|
| Title | Failed Endpoints (5xx) |
| Type | Table |
| Data source | Prometheus |
Query:
topk(10,
sum(increase(appserver_http_requests_total{job=~"$service", status_group="5xx"}[1h])) by (path, method)
)
Transformations:
- Labels to fields
- Organize fields: Show
path,method,Value - Rename
Value→Errors
Options:
- Column width:
path= 300px - Cell display mode for
Errors:Color background - Thresholds:
1(yellow),10(red)
Data Links:
- Title:
View Error Traces - URL:
/explore?left={"datasource":"tempo","queries":[{"refId":"A","queryType":"traceql","query":"{resource.service.name=\"$service\" && name=~\".*${__data.fields.path}.*\" && status=error}"}]}
Row 4: Errors
Create a row named "Errors" with 2 panels.
Panel 10: Error Rate Over Time
| Setting | Value |
|---|---|
| Title | Error Rate Over Time |
| Type | Time series |
| Data source | Prometheus |
Query A (5xx errors):
sum(rate(appserver_http_requests_total{job=~"$service", status_group="5xx"}[5m])) by (job)
Legend: {{job}} - 5xx
Query B (4xx errors):
sum(rate(appserver_http_requests_total{job=~"$service", status_group="4xx"}[5m])) by (job)
Legend: {{job}} - 4xx
Options:
- Color: Query A = red, Query B = orange
- Fill opacity:
20
Panel 11: Errors by Status Code
| Setting | Value |
|---|---|
| Title | Errors by Status Code |
| Type | Bar gauge |
| Data source | Prometheus |
Query:
topk(10,
sum(increase(appserver_http_requests_total{job=~"$service", status_group=~"4xx|5xx"}[1h])) by (status)
)
Options:
- Orientation:
Horizontal - Display mode:
Gradient - Color:
Red - Text mode:
Value and name
Row 5: Logs
Create a row named "Logs" with 2 log panels.
Panel 12: Recent Errors
| Setting | Value |
|---|---|
| Title | Recent Errors |
| Type | Logs |
| Data source | Loki |
Query:
{instance_id=~"$instance", service_name=~"$service"} | json | level = `error`
Options:
- Order:
Newest first - Enable log details:
Yes - Show time:
Yes - Wrap lines:
Yes
The trace_id field automatically becomes a clickable link to Tempo (configured via Loki derived fields).
Panel 13: Recent Warnings
| Setting | Value |
|---|---|
| Title | Recent Warnings |
| Type | Logs |
| Data source | Loki |
Query:
{instance_id=~"$instance", service_name=~"$service"} | json | level = `warn`
Options:
- Same as Recent Errors panel
Drill-Down Configuration
Metrics to Traces
To link from a metric panel to traces, add a Data Link:
- Edit the panel
- Go to Field tab → Data links → Add link
- Configure:
| Setting | Value |
|---|---|
| Title | View Traces |
| URL | See below |
URL Template for Tempo:
/explore?left={"datasource":"tempo","queries":[{"refId":"A","queryType":"traceql","query":"{resource.service.name=\"${__field.labels.job}\" && duration > 100ms}"}],"range":{"from":"${__from}","to":"${__to}"}}
Logs to Traces
Logs automatically link to traces via Loki's derived fields configuration. When you click on a log entry containing trace_id, it opens the trace in Tempo.
This is pre-configured in docker/observability/grafana/provisioning/datasources/datasources.yml:
derivedFields:
- datasourceUid: tempo
matcherRegex: '"trace_id":"(\\w+)"'
name: TraceID
url: "$${__value.raw}"
Table Row to Traces
For table panels (Slowest Endpoints, Failed Endpoints), use field-based data links:
/explore?left={"datasource":"tempo","queries":[{"refId":"A","queryType":"traceql","query":"{resource.service.name=\"$service\" && name=~\".*${__data.fields.path}.*\"}"}]}
Available field variables:
${__data.fields.path}- The path column value${__data.fields.method}- The HTTP method${__data.fields.status}- The status code
Quick Reference: All Queries
Stat Panels
# Requests per minute
sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) * 60
# Error rate percentage
sum(rate(appserver_http_requests_total{job=~"$service", status_group="5xx"}[5m]))
/ sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) * 100
# Average latency
sum(rate(appserver_http_request_duration_ms_sum{instance_id=~"$instance", job=~"$service"}[5m]))
/ sum(rate(appserver_http_request_duration_ms_count{instance_id=~"$instance", job=~"$service"}[5m]))
# P95 latency
histogram_quantile(0.95,
sum(rate(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[5m])) by (le))
# Active requests
sum(appserver_http_requests_active{instance_id=~"$instance", job=~"$service"})
Time Series
# Request rate by service
sum(rate(appserver_http_requests_total{instance_id=~"$instance", job=~"$service"}[5m])) by (job)
# Error rate over time
sum(rate(appserver_http_requests_total{job=~"$service", status_group="5xx"}[5m])) by (job)
Tables
# Slowest endpoints
topk(10, histogram_quantile(0.95,
sum(rate(appserver_http_request_duration_ms_bucket{instance_id=~"$instance", job=~"$service"}[5m])) by (le, path, method)))
# Failed endpoints
topk(10, sum(increase(appserver_http_requests_total{job=~"$service", status_group="5xx"}[1h])) by (path, method))
# Errors by status
topk(10, sum(increase(appserver_http_requests_total{job=~"$service", status_group=~"4xx|5xx"}[1h])) by (status))
Logs
# Recent errors
{instance_id=~"$instance", service_name=~"$service"} | json | level = `error`
# Recent warnings
{instance_id=~"$instance", service_name=~"$service"} | json | level = `warn`
# Errors with specific message
{instance_id=~"$instance", service_name=~"$service"} | json | level = `error` |= `connection refused`
# Slow requests (from logs)
{instance_id=~"$instance", service_name=~"$service"} | json | duration_ms > 1000
Alerting
Add alert rules to key panels for proactive monitoring.
High Error Rate Alert
- Edit the Error Rate stat panel
- Go to Alert tab → Create alert rule
| Setting | Value |
|---|---|
| Condition | WHEN last() OF A IS ABOVE 5 |
| Evaluate every | 1m |
| For | 5m |
| Alert name | High Error Rate |
| Summary | Error rate is $value% |
High Latency Alert
| Setting | Value |
|---|---|
| Condition | WHEN last() OF A IS ABOVE 1000 |
| Evaluate every | 1m |
| For | 5m |
| Alert name | High P95 Latency |
| Summary | P95 latency is $value ms |
Dashboard JSON Export
After creating the dashboard, export it for version control:
- Go to Dashboard settings → JSON Model
- Copy the JSON
- Save to
docker/observability/grafana/provisioning/dashboards/observability/appserver-dashboard.json
The dashboard will auto-load on next Grafana restart.
Related Topics
- Grafana Integration - Query reference and basics
- Telemetry Configuration - Configure telemetry export
- Telemetry Infrastructure - Backend architecture
- Multi-Instance Telemetry - Architecture for monitoring multiple customer instances