Skip to main content

Telemetry

Llama Stack uses OpenTelemetry for observability. It provides two layers of instrumentation:

  • Auto-instrumentation (zero-code) captures HTTP requests, database queries, and GenAI calls from supported SDKs (OpenAI, Bedrock, Vertex AI, etc.)
  • Manual instrumentation emits domain-specific metrics for inference latency, tool execution, vector store operations, and request throughput

Both layers export data through the standard OTLP protocol to any compatible backend (Jaeger, Prometheus, Grafana, MLflow, Datadog, etc.)

Quick start

Install the OpenTelemetry packages and wrap the server command with opentelemetry-instrument:

# Install OpenTelemetry packages
uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -

# Run with instrumentation
export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318"

uv run opentelemetry-instrument \
--traces_exporter otlp \
--metrics_exporter otlp \
--service_name llama-stack-server \
-- \
llama stack run starter

This sends traces and metrics to an OTLP collector on port 4318. The next section shows how to set up the full observability stack.

Observability stack setup

The repository includes a one-command setup script that deploys Jaeger (traces), an OpenTelemetry Collector, Prometheus (metrics), and Grafana (dashboards) using Docker or Podman.

Architecture

Deploy

# Auto-detects Docker or Podman
./scripts/telemetry/setup_telemetry.sh

# Or specify explicitly
./scripts/telemetry/setup_telemetry.sh --container docker
./scripts/telemetry/setup_telemetry.sh --container podman

This creates a llama-telemetry container network and starts all four services with pre-provisioned Grafana dashboards.

Access the UIs

ServiceURLCredentials
Jaeger (traces)http://localhost:16686N/A
Prometheus (metrics)http://localhost:9090N/A
Grafana (dashboards)http://localhost:3000admin / admin

Cleanup

# Replace "docker" with "podman" if applicable
docker stop jaeger otel-collector prometheus grafana
docker rm jaeger otel-collector prometheus grafana
docker network rm llama-telemetry

Client-side instrumentation

You can instrument your client application the same way as the server. This captures outbound HTTP calls to Llama Stack and correlates them with server-side traces.

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318

uv run opentelemetry-instrument \
--traces_exporter otlp \
--metrics_exporter otlp \
--service_name my-llama-stack-app \
-- \
python my_app.py

Example my_app.py:

from openai import OpenAI

client = OpenAI(
api_key="fake",
base_url="http://localhost:8321/v1/",
)

response = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.choices[0].message.content)

Metrics reference

Llama Stack emits metrics across five domains. All metric names use the llama_stack prefix.

Inference

MetricTypeUnitDescription
llama_stack.inference.duration_secondsHistogramsEnd-to-end inference latency
llama_stack.inference.time_to_first_token_secondsHistogramsTime to first content token (streaming only)
llama_stack.inference.tokens_per_secondHistogram-Output token throughput

Attributes: model, provider, stream, status

Tool runtime

MetricTypeUnitDescription
llama_stack.tool_runtime.invocations_totalCounter1Total tool invocations
llama_stack.tool_runtime.duration_secondsHistogramsTool execution latency

Attributes: tool_group, tool_name, provider, status

Vector IO

MetricTypeUnitDescription
llama_stack.vector_io.inserts_totalCounter1Vector insert operations
llama_stack.vector_io.queries_totalCounter1Vector query/search operations
llama_stack.vector_io.deletes_totalCounter1Vector delete operations
llama_stack.vector_io.stores_totalCounter1Vector stores created
llama_stack.vector_io.files_totalCounter1Files attached to vector stores
llama_stack.vector_io.chunks_processed_totalCounter1Chunks processed across inserts
llama_stack.vector_io.insert_duration_secondsHistogramsInsert operation latency
llama_stack.vector_io.retrieval_duration_secondsHistogramsRetrieval operation latency

Attributes: vector_db, operation, provider, status, search_mode

Request level

MetricTypeUnitDescription
llama_stack.requests_totalCounter1Total HTTP requests by API, method, and status
llama_stack.request_duration_secondsHistogramsRequest latency by API and method
llama_stack.concurrent_requestsGauge1Current in-flight requests by API

Attributes: api, method, status_code (counter only)

Responses API

MetricTypeUnitDescription
llama_stack.responses.parameter_usage_totalCounter1Responses API parameter usage tracking

Grafana dashboards

The setup script provisions six pre-built dashboards:

DashboardWhat it shows
Llama StackOverview: token usage by model, P95/P99 HTTP duration, total requests
Inference MetricsInference latency distribution, time-to-first-token, tokens/sec by model and provider
Request MetricsRequest rates, error rates, concurrent requests, latency percentiles by API endpoint
Responses MetricsResponses API parameter usage patterns
Tool Runtime MetricsTool invocation counts and latency by tool group and name
Vector IO MetricsInsert/query/delete rates, chunk processing volumes, operation latency

PromQL examples

# Total input token usage by model
sum by(gen_ai_request_model) (llama_stack_gen_ai_client_token_usage_sum{gen_ai_token_type="input"})

# Total output token usage by model
sum by(gen_ai_request_model) (llama_stack_gen_ai_client_token_usage_sum{gen_ai_token_type="output"})

# P95 HTTP server latency
histogram_quantile(0.95, rate(llama_stack_http_server_duration_milliseconds_bucket[5m]))

# P99 inference duration by model
histogram_quantile(0.99, rate(llama_stack_inference_duration_seconds_bucket[5m]))

# Tool invocation rate by tool name
rate(llama_stack_tool_runtime_invocations_total[5m])

# Vector insert throughput
rate(llama_stack_vector_io_inserts_total[5m])

GenAI message content capture

By default, prompt and response content is not captured for privacy. To enable content capture:

export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true

When enabled, content is emitted as log events (e.g., gen_ai.user.message, gen_ai.choice) with trace_id and span_id for correlation. Spans carry structured metadata (model, finish reason, token usage) but not the raw text.

Exporter configuration

OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true \
uv run opentelemetry-instrument \
--traces_exporter console \
--logs_exporter console \
-- \
python my_app.py
Jaeger and logs

Jaeger ingests traces only, not logs. If you set OTEL_LOGS_EXPORTER=otlp and point it at Jaeger, logs will be rejected (404). Use an OTel Collector to route logs to a log-capable backend, or use OTEL_LOGS_EXPORTER=console for debugging.

Environment variables

VariableDefaultDescription
OTEL_EXPORTER_OTLP_ENDPOINT(none)OTLP endpoint URL. Metrics and traces are only exported when set.
OTEL_EXPORTER_OTLP_PROTOCOLhttp/protobufTransport protocol for OTLP export
OTEL_SERVICE_NAMEllama-stackService name tag on all telemetry data
OTEL_METRIC_EXPORT_INTERVAL60000Metric export interval in milliseconds
OTEL_PYTHON_DISABLED_INSTRUMENTATIONS(none)Comma-separated list of instrumentors to disable
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENTfalseCapture prompt/response text as log events

Known issues

When OpenTelemetry auto-instrumentation is enabled, both the low-level database driver instrumentor (e.g. asyncpg, sqlite3) and the SQLAlchemy ORM instrumentor activate simultaneously. This causes every database operation to be traced twice - once at the ORM level and once at the raw protocol level. The driver-level spans expose internal pool mechanics (such as connection health-check queries) that inflate traces with noise. To prevent this, disable the driver-level instrumentors and rely on the SQLAlchemy instrumentation alone:

export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3,asyncpg"
note

The container image sets this automatically when any OTEL_* environment variable is present.