Grafana Integration
Guide for using Grafana to explore and correlate traces, logs, and metrics.
Overview
Grafana provides a unified interface for exploring all telemetry data:
- Tempo for distributed traces
- Loki for logs
- Prometheus for metrics
The datasources are pre-configured with cross-signal correlation, allowing you to navigate seamlessly between traces, logs, and metrics.
Accessing Grafana
URL: http://localhost:3000
Default Credentials:
- Username:
admin - Password:
admin
Data Flow
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Browser │ │ AppServer │ │ Node.js App │
│ │ │ (Go) │ │ │
│ trace_id: │────▶│ trace_id: │────▶│ trace_id: │
│ abc123... │ │ abc123... │ │ abc123... │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
│ OTLP/HTTP to OTEL_EXPORTER_OTLP_ENDPOINT
▼ ▼ ▼
┌─────────────────────────────────────────────────────────┐
│ OTEL Collector (localhost:4318) │
│ Receives → Processes → Exports to backends │
└─────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌─────────┐
│ Tempo │ │Prometheus│ │ Loki │
│ (:3200) │ │ (:9090) │ │ (:3100) │
└─────────┘ └──────────┘ └─────────┘
│ │ │
└───────────────────┴────────────────────┘
│
▼
┌──────────────┐
│ Grafana │
│ (:3000) │
└──────────────┘
Available Metrics
Metrics are exported by the HTTP middleware. Both Go and Node.js runtimes use the same metric names and labels.
Common HTTP Metrics
| Metric | Type | Description |
|---|---|---|
http_requests_total | Counter | Total HTTP requests |
http_request_duration_ms | Histogram | Request duration in milliseconds |
Common Labels
| Label | Description |
|---|---|
method | HTTP method (GET, POST, etc.) |
path | Request path |
status | HTTP status code as string |
status_group | Status group (2xx, 3xx, 4xx, 5xx) |
Node.js Additional Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
http_requests_errors_total | Counter | method, path, status, status_group, error_type | Total HTTP errors |
http_requests_active | UpDownCounter | method, path | Currently active requests |
Note: The OTEL Collector adds the appserver_ prefix to metrics via its Prometheus exporter namespace configuration.
Searching Traces
By Trace ID
- Navigate to Explore in the left sidebar
- Select Tempo datasource
- Choose TraceQL tab
- Enter the trace ID directly in the search box, or use a query:
{ traceDuration > 0s } | select(trace:id = "abc123def456...")
Or simply paste the trace ID into Tempo's "Search" tab.
By Service Name
Find all traces from a specific service:
{ resource.service.name = "appserver" }
By Span Name
Find traces containing specific operations:
{ name = "HTTP GET" }
By Duration
Find slow traces (> 1 second):
{ duration > 1s }
By Status
Find failed requests:
{ status = error }
Combined Queries
{ resource.service.name = "appserver" && duration > 500ms && status = error }
Searching Logs
Logs are sent to Loki via the OTEL Collector. The log entries contain structured JSON with fields like trace_id, span_id, method, path, status, etc.
By Service
- Navigate to Explore
- Select Loki datasource
- Use LogQL:
{service_name="appserver"}
Or by job label (if configured):
{job="appserver"}
By Log Level
Using JSON parsing to filter by level:
{service_name="appserver"} | json | level = `error`
By Trace ID
Find all logs for a specific request:
{service_name="appserver"} | json | trace_id = `abc123def456...`
By Request ID
{service_name="appserver"} | json | request_id = `req-456...`
Text Search
{service_name="appserver"} |= `connection refused`
Filtering by Duration
{service_name="appserver"} | json | duration_ms > 1000
Cross-Signal Correlation
Logs to Traces
When viewing logs in Explore, click on the TraceID link in log entries to jump directly to the corresponding trace in Tempo.
This works because Loki is configured with derived fields that extract trace_id from JSON logs:
derivedFields:
- datasourceUid: tempo
matcherRegex: '"trace_id":"(\\w+)"'
name: TraceID
url: "$${__value.raw}"
Prerequisites for log-to-trace correlation:
- Logs must include
trace_idfield (automatically added by logging middleware when a span is active) - The tracing middleware must run before the logging middleware to ensure span context is available
Traces to Logs
When viewing a trace in Tempo:
- Click on any span
- Look for the Logs button in the span details panel
- Click to see all logs emitted during that span's execution
The Tempo datasource is configured to link to Loki filtering by trace_id.
Metrics to Traces (Exemplars)
Exemplars link specific metric data points to traces. This requires:
- Tempo's metrics generator to be enabled (configured in
tempo-config.yml) - Tempo to write exemplars to Prometheus via remote write
To use exemplars:
- In a Prometheus graph, enable the Exemplars toggle
- Hover over exemplar points (diamond markers)
- Click to view the associated trace
TraceQL Reference
Basic Syntax
{ <spanset filter> }
Resource Attributes
{ resource.service.name = "appserver" }
{ resource.deployment.environment = "production" }
Span Attributes
Use semantic convention attribute names:
{ span.http.request.method = "POST" }
{ span.http.response.status_code >= 400 }
{ span.db.statement =~ "SELECT.*users" }
Intrinsic Attributes
| Attribute | Description |
|---|---|
name | Span name |
status | Span status (ok, error, unset) |
duration | Span duration |
kind | Span kind (server, client, internal, producer, consumer) |
Operators
| Operator | Description |
|---|---|
= | Equals |
!= | Not equals |
>, >=, <, <= | Numeric comparison |
=~ | Regex match |
!~ | Regex not match |
&& | AND |
|| | OR |
Duration Units
ns- nanosecondsus- microsecondsms- millisecondss- secondsm- minutesh- hours
Examples
Find HTTP errors:
{ span.http.response.status_code >= 500 && duration > 100ms }
Find database queries:
{ span.db.system = "postgresql" && duration > 50ms }
LogQL Reference
Stream Selectors
{service_name="appserver"} # By service name
{service_name="appserver", level="error"} # Multiple labels
{service_name=~"app.*"} # Regex match
{service_name!="appserver"} # Not equals
Line Filters
| Filter | Description |
|---|---|
|= | Contains |
!= | Does not contain |
|~ | Regex match |
!~ | Regex not match |
JSON Parsing
{service_name="appserver"} | json
After parsing, access fields:
{service_name="appserver"} | json | method = `POST`
{service_name="appserver"} | json | duration_ms > 1000
Formatting Output
{service_name="appserver"} | json | line_format `{{.level}}: {{.message}}`
Aggregations
Count errors per minute:
count_over_time({service_name="appserver"} | json | level = `error` [1m])
Rate of requests (logs per second):
rate({service_name="appserver"} [5m])
PromQL Reference
Basic Queries
# Current value (with appserver_ prefix from collector)
appserver_http_requests_total
# Rate over 5 minutes
rate(appserver_http_requests_total[5m])
# Filter by labels
appserver_http_requests_total{method="GET", status="200"}
# Filter by status group
appserver_http_requests_total{status_group="2xx"}
Aggregations
# Sum across all instances
sum(rate(appserver_http_requests_total[5m]))
# Average by method
avg by (method) (appserver_http_request_duration_ms)
# 95th percentile latency (milliseconds)
histogram_quantile(0.95, rate(appserver_http_request_duration_ms_bucket[5m]))
Creating Dashboards
Request Explorer Dashboard
Create a dashboard for exploring requests:
- New Dashboard → Add Panel
- Add the following panels:
Request Rate (works for both Go and Node.js):
sum(rate(appserver_http_requests_total[5m])) by (method)
Error Rate:
sum(rate(appserver_http_requests_total{status_group="5xx"}[5m]))
/
sum(rate(appserver_http_requests_total[5m]))
Latency p95 (milliseconds):
histogram_quantile(0.95,
sum(rate(appserver_http_request_duration_ms_bucket[5m])) by (le)
)
Recent Errors (Logs):
{service_name=~"appserver|orchestrator"} | json | level = `error`
Service Health Dashboard
Uptime:
up{job="appserver"}
Memory Usage:
process_resident_memory_bytes{job="appserver"} / 1024 / 1024
Goroutines (Go services):
go_goroutines{job="appserver"}
Error Tracking Dashboard
Error Count by Status:
sum by (status) (increase(appserver_http_requests_total{status_group="5xx"}[1h]))
Error Logs:
{service_name="appserver"} | json | level = `error` | line_format `{{.message}}`
Failed Traces: Link to Tempo with query:
{ resource.service.name = "appserver" && status = error }
Dashboard Variables
Create interactive dashboards with variables:
Service Variable
- Dashboard Settings → Variables → Add
- Name:
service - Type: Query
- Data source: Prometheus
- Query:
label_values(appserver_http_requests_total, job)
Use in queries:
rate(appserver_http_requests_total{job="$service"}[5m])
Method Variable
label_values(appserver_http_requests_total, method)
Time Range Variable
Use built-in $__range variable:
increase(appserver_http_requests_total[$__range])
Alerting
Creating Alert Rules
- Navigate to Alerting → Alert rules
- Click New alert rule
High Error Rate Alert:
sum(rate(appserver_http_requests_total{status_group="5xx"}[5m]))
/
sum(rate(appserver_http_requests_total[5m]))
> 0.05
High Latency Alert (threshold: 2000ms):
histogram_quantile(0.95,
sum(rate(appserver_http_request_duration_ms_bucket[5m])) by (le)
) > 2000
Log-Based Alerts
Using Loki:
count_over_time({service_name="appserver"} |= `CRITICAL` [5m]) > 10
Tips and Best Practices
Efficient Querying
- Use time ranges - Always specify appropriate time ranges
- Filter early - Apply label filters before line filters in LogQL
- Limit results - Use
limitin TraceQL for large result sets
Log Fields for Correlation
Ensure your logs include these fields for cross-signal correlation:
| Field | Description | Added By |
|---|---|---|
trace_id | W3C trace ID | Logging middleware (when span is active) |
span_id | Current span ID | Logging middleware (when span is active) |
request_id | Request correlation ID | RequestID middleware |
service | Service name | Logger default meta |
Organizing Dashboards
- Folders - Group dashboards by team or system
- Tags - Add tags like
appserver,production,alerts - Links - Add dashboard links for easy navigation
Performance
- Sampling - Ensure trace sampling is appropriate for load
- Retention - Monitor storage usage and adjust retention
- Caching - Enable result caching for frequently-used queries