Skip to main content

Grafana Integration

Guide for using Grafana to explore and correlate traces, logs, and metrics.

Overview

Grafana provides a unified interface for exploring all telemetry data:

  • Tempo for distributed traces
  • Loki for logs
  • Prometheus for metrics

The datasources are pre-configured with cross-signal correlation, allowing you to navigate seamlessly between traces, logs, and metrics.

Accessing Grafana

URL: http://localhost:3000

Default Credentials:

  • Username: admin
  • Password: admin

Data Flow

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│ Browser │ │ AppServer │ │ Node.js App │
│ │ │ (Go) │ │ │
│ trace_id: │────▶│ trace_id: │────▶│ trace_id: │
│ abc123... │ │ abc123... │ │ abc123... │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
│ OTLP/HTTP to OTEL_EXPORTER_OTLP_ENDPOINT
▼ ▼ ▼
┌─────────────────────────────────────────────────────────┐
│ OTEL Collector (localhost:4318) │
│ Receives → Processes → Exports to backends │
└─────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌─────────┐
│ Tempo │ │Prometheus│ │ Loki │
│ (:3200) │ │ (:9090) │ │ (:3100) │
└─────────┘ └──────────┘ └─────────┘
│ │ │
└───────────────────┴────────────────────┘


┌──────────────┐
│ Grafana │
│ (:3000) │
└──────────────┘

Available Metrics

Metrics are exported by the HTTP middleware. Both Go and Node.js runtimes use the same metric names and labels.

Common HTTP Metrics

MetricTypeDescription
http_requests_totalCounterTotal HTTP requests
http_request_duration_msHistogramRequest duration in milliseconds

Common Labels

LabelDescription
methodHTTP method (GET, POST, etc.)
pathRequest path
statusHTTP status code as string
status_groupStatus group (2xx, 3xx, 4xx, 5xx)

Node.js Additional Metrics

MetricTypeLabelsDescription
http_requests_errors_totalCountermethod, path, status, status_group, error_typeTotal HTTP errors
http_requests_activeUpDownCountermethod, pathCurrently active requests

Note: The OTEL Collector adds the appserver_ prefix to metrics via its Prometheus exporter namespace configuration.

Searching Traces

By Trace ID

  1. Navigate to Explore in the left sidebar
  2. Select Tempo datasource
  3. Choose TraceQL tab
  4. Enter the trace ID directly in the search box, or use a query:
{ traceDuration > 0s } | select(trace:id = "abc123def456...")

Or simply paste the trace ID into Tempo's "Search" tab.

By Service Name

Find all traces from a specific service:

{ resource.service.name = "appserver" }

By Span Name

Find traces containing specific operations:

{ name = "HTTP GET" }

By Duration

Find slow traces (> 1 second):

{ duration > 1s }

By Status

Find failed requests:

{ status = error }

Combined Queries

{ resource.service.name = "appserver" && duration > 500ms && status = error }

Searching Logs

Logs are sent to Loki via the OTEL Collector. The log entries contain structured JSON with fields like trace_id, span_id, method, path, status, etc.

By Service

  1. Navigate to Explore
  2. Select Loki datasource
  3. Use LogQL:
{service_name="appserver"}

Or by job label (if configured):

{job="appserver"}

By Log Level

Using JSON parsing to filter by level:

{service_name="appserver"} | json | level = `error`

By Trace ID

Find all logs for a specific request:

{service_name="appserver"} | json | trace_id = `abc123def456...`

By Request ID

{service_name="appserver"} | json | request_id = `req-456...`
{service_name="appserver"} |= `connection refused`

Filtering by Duration

{service_name="appserver"} | json | duration_ms > 1000

Cross-Signal Correlation

Logs to Traces

When viewing logs in Explore, click on the TraceID link in log entries to jump directly to the corresponding trace in Tempo.

This works because Loki is configured with derived fields that extract trace_id from JSON logs:

derivedFields:
- datasourceUid: tempo
matcherRegex: '"trace_id":"(\\w+)"'
name: TraceID
url: "$${__value.raw}"

Prerequisites for log-to-trace correlation:

  • Logs must include trace_id field (automatically added by logging middleware when a span is active)
  • The tracing middleware must run before the logging middleware to ensure span context is available

Traces to Logs

When viewing a trace in Tempo:

  1. Click on any span
  2. Look for the Logs button in the span details panel
  3. Click to see all logs emitted during that span's execution

The Tempo datasource is configured to link to Loki filtering by trace_id.

Metrics to Traces (Exemplars)

Exemplars link specific metric data points to traces. This requires:

  1. Tempo's metrics generator to be enabled (configured in tempo-config.yml)
  2. Tempo to write exemplars to Prometheus via remote write

To use exemplars:

  1. In a Prometheus graph, enable the Exemplars toggle
  2. Hover over exemplar points (diamond markers)
  3. Click to view the associated trace

TraceQL Reference

Basic Syntax

{ <spanset filter> }

Resource Attributes

{ resource.service.name = "appserver" }
{ resource.deployment.environment = "production" }

Span Attributes

Use semantic convention attribute names:

{ span.http.request.method = "POST" }
{ span.http.response.status_code >= 400 }
{ span.db.statement =~ "SELECT.*users" }

Intrinsic Attributes

AttributeDescription
nameSpan name
statusSpan status (ok, error, unset)
durationSpan duration
kindSpan kind (server, client, internal, producer, consumer)

Operators

OperatorDescription
=Equals
!=Not equals
>, >=, <, <=Numeric comparison
=~Regex match
!~Regex not match
&&AND
||OR

Duration Units

  • ns - nanoseconds
  • us - microseconds
  • ms - milliseconds
  • s - seconds
  • m - minutes
  • h - hours

Examples

Find HTTP errors:

{ span.http.response.status_code >= 500 && duration > 100ms }

Find database queries:

{ span.db.system = "postgresql" && duration > 50ms }

LogQL Reference

Stream Selectors

{service_name="appserver"}              # By service name
{service_name="appserver", level="error"} # Multiple labels
{service_name=~"app.*"} # Regex match
{service_name!="appserver"} # Not equals

Line Filters

FilterDescription
|=Contains
!=Does not contain
|~Regex match
!~Regex not match

JSON Parsing

{service_name="appserver"} | json

After parsing, access fields:

{service_name="appserver"} | json | method = `POST`
{service_name="appserver"} | json | duration_ms > 1000

Formatting Output

{service_name="appserver"} | json | line_format `{{.level}}: {{.message}}`

Aggregations

Count errors per minute:

count_over_time({service_name="appserver"} | json | level = `error` [1m])

Rate of requests (logs per second):

rate({service_name="appserver"} [5m])

PromQL Reference

Basic Queries

# Current value (with appserver_ prefix from collector)
appserver_http_requests_total

# Rate over 5 minutes
rate(appserver_http_requests_total[5m])

# Filter by labels
appserver_http_requests_total{method="GET", status="200"}

# Filter by status group
appserver_http_requests_total{status_group="2xx"}

Aggregations

# Sum across all instances
sum(rate(appserver_http_requests_total[5m]))

# Average by method
avg by (method) (appserver_http_request_duration_ms)

# 95th percentile latency (milliseconds)
histogram_quantile(0.95, rate(appserver_http_request_duration_ms_bucket[5m]))

Creating Dashboards

Request Explorer Dashboard

Create a dashboard for exploring requests:

  1. New DashboardAdd Panel
  2. Add the following panels:

Request Rate (works for both Go and Node.js):

sum(rate(appserver_http_requests_total[5m])) by (method)

Error Rate:

sum(rate(appserver_http_requests_total{status_group="5xx"}[5m]))
/
sum(rate(appserver_http_requests_total[5m]))

Latency p95 (milliseconds):

histogram_quantile(0.95,
sum(rate(appserver_http_request_duration_ms_bucket[5m])) by (le)
)

Recent Errors (Logs):

{service_name=~"appserver|orchestrator"} | json | level = `error`

Service Health Dashboard

Uptime:

up{job="appserver"}

Memory Usage:

process_resident_memory_bytes{job="appserver"} / 1024 / 1024

Goroutines (Go services):

go_goroutines{job="appserver"}

Error Tracking Dashboard

Error Count by Status:

sum by (status) (increase(appserver_http_requests_total{status_group="5xx"}[1h]))

Error Logs:

{service_name="appserver"} | json | level = `error` | line_format `{{.message}}`

Failed Traces: Link to Tempo with query:

{ resource.service.name = "appserver" && status = error }

Dashboard Variables

Create interactive dashboards with variables:

Service Variable

  1. Dashboard Settings → Variables → Add
  2. Name: service
  3. Type: Query
  4. Data source: Prometheus
  5. Query: label_values(appserver_http_requests_total, job)

Use in queries:

rate(appserver_http_requests_total{job="$service"}[5m])

Method Variable

label_values(appserver_http_requests_total, method)

Time Range Variable

Use built-in $__range variable:

increase(appserver_http_requests_total[$__range])

Alerting

Creating Alert Rules

  1. Navigate to AlertingAlert rules
  2. Click New alert rule

High Error Rate Alert:

sum(rate(appserver_http_requests_total{status_group="5xx"}[5m]))
/
sum(rate(appserver_http_requests_total[5m]))
> 0.05

High Latency Alert (threshold: 2000ms):

histogram_quantile(0.95,
sum(rate(appserver_http_request_duration_ms_bucket[5m])) by (le)
) > 2000

Log-Based Alerts

Using Loki:

count_over_time({service_name="appserver"} |= `CRITICAL` [5m]) > 10

Tips and Best Practices

Efficient Querying

  1. Use time ranges - Always specify appropriate time ranges
  2. Filter early - Apply label filters before line filters in LogQL
  3. Limit results - Use limit in TraceQL for large result sets

Log Fields for Correlation

Ensure your logs include these fields for cross-signal correlation:

FieldDescriptionAdded By
trace_idW3C trace IDLogging middleware (when span is active)
span_idCurrent span IDLogging middleware (when span is active)
request_idRequest correlation IDRequestID middleware
serviceService nameLogger default meta

Organizing Dashboards

  1. Folders - Group dashboards by team or system
  2. Tags - Add tags like appserver, production, alerts
  3. Links - Add dashboard links for easy navigation

Performance

  1. Sampling - Ensure trace sampling is appropriate for load
  2. Retention - Monitor storage usage and adjust retention
  3. Caching - Enable result caching for frequently-used queries