Telemetry
Llama Stack uses OpenTelemetry for observability. It provides two layers of instrumentation:
- Auto-instrumentation (zero-code) captures HTTP requests, database queries, and GenAI calls from supported SDKs (OpenAI, Bedrock, Vertex AI, etc.)
- Manual instrumentation emits domain-specific metrics for inference latency, tool execution, vector store operations, and request throughput
Both layers export data through the standard OTLP protocol to any compatible backend (Jaeger, Prometheus, Grafana, MLflow, Datadog, etc.)
Quick start
Install the OpenTelemetry packages and wrap the server command with opentelemetry-instrument:
# Install OpenTelemetry packages
uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
# Run with instrumentation
export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318"
uv run opentelemetry-instrument \
--traces_exporter otlp \
--metrics_exporter otlp \
--service_name llama-stack-server \
-- \
llama stack run starter
This sends traces and metrics to an OTLP collector on port 4318. The next section shows how to set up the full observability stack.
Observability stack setup
The repository includes a one-command setup script that deploys Jaeger (traces), an OpenTelemetry Collector, Prometheus (metrics), and Grafana (dashboards) using Docker or Podman.
Architecture
Deploy
# Auto-detects Docker or Podman
./scripts/telemetry/setup_telemetry.sh
# Or specify explicitly
./scripts/telemetry/setup_telemetry.sh --container docker
./scripts/telemetry/setup_telemetry.sh --container podman
This creates a llama-telemetry container network and starts all four services with pre-provisioned Grafana dashboards.
Access the UIs
| Service | URL | Credentials |
|---|---|---|
| Jaeger (traces) | http://localhost:16686 | N/A |
| Prometheus (metrics) | http://localhost:9090 | N/A |
| Grafana (dashboards) | http://localhost:3000 | admin / admin |
Cleanup
# Replace "docker" with "podman" if applicable
docker stop jaeger otel-collector prometheus grafana
docker rm jaeger otel-collector prometheus grafana
docker network rm llama-telemetry
Client-side instrumentation
You can instrument your client application the same way as the server. This captures outbound HTTP calls to Llama Stack and correlates them with server-side traces.
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
uv run opentelemetry-instrument \
--traces_exporter otlp \
--metrics_exporter otlp \
--service_name my-llama-stack-app \
-- \
python my_app.py
Example my_app.py:
from openai import OpenAI
client = OpenAI(
api_key="fake",
base_url="http://localhost:8321/v1/",
)
response = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.choices[0].message.content)
Metrics reference
Llama Stack emits metrics across five domains. All metric names use the llama_stack prefix.
Inference
| Metric | Type | Unit | Description |
|---|---|---|---|
llama_stack.inference.duration_seconds | Histogram | s | End-to-end inference latency |
llama_stack.inference.time_to_first_token_seconds | Histogram | s | Time to first content token (streaming only) |
llama_stack.inference.tokens_per_second | Histogram | - | Output token throughput |
Attributes: model, provider, stream, status
Tool runtime
| Metric | Type | Unit | Description |
|---|---|---|---|
llama_stack.tool_runtime.invocations_total | Counter | 1 | Total tool invocations |
llama_stack.tool_runtime.duration_seconds | Histogram | s | Tool execution latency |
Attributes: tool_group, tool_name, provider, status
Vector IO
| Metric | Type | Unit | Description |
|---|---|---|---|
llama_stack.vector_io.inserts_total | Counter | 1 | Vector insert operations |
llama_stack.vector_io.queries_total | Counter | 1 | Vector query/search operations |
llama_stack.vector_io.deletes_total | Counter | 1 | Vector delete operations |
llama_stack.vector_io.stores_total | Counter | 1 | Vector stores created |
llama_stack.vector_io.files_total | Counter | 1 | Files attached to vector stores |
llama_stack.vector_io.chunks_processed_total | Counter | 1 | Chunks processed across inserts |
llama_stack.vector_io.insert_duration_seconds | Histogram | s | Insert operation latency |
llama_stack.vector_io.retrieval_duration_seconds | Histogram | s | Retrieval operation latency |
Attributes: vector_db, operation, provider, status, search_mode
Request level
| Metric | Type | Unit | Description |
|---|---|---|---|
llama_stack.requests_total | Counter | 1 | Total HTTP requests by API, method, and status |
llama_stack.request_duration_seconds | Histogram | s | Request latency by API and method |
llama_stack.concurrent_requests | Gauge | 1 | Current in-flight requests by API |
Attributes: api, method, status_code (counter only)
Responses API
| Metric | Type | Unit | Description |
|---|---|---|---|
llama_stack.responses.parameter_usage_total | Counter | 1 | Responses API parameter usage tracking |
Grafana dashboards
The setup script provisions six pre-built dashboards:
| Dashboard | What it shows |
|---|---|
| Llama Stack | Overview: token usage by model, P95/P99 HTTP duration, total requests |
| Inference Metrics | Inference latency distribution, time-to-first-token, tokens/sec by model and provider |
| Request Metrics | Request rates, error rates, concurrent requests, latency percentiles by API endpoint |
| Responses Metrics | Responses API parameter usage patterns |
| Tool Runtime Metrics | Tool invocation counts and latency by tool group and name |
| Vector IO Metrics | Insert/query/delete rates, chunk processing volumes, operation latency |
PromQL examples
# Total input token usage by model
sum by(gen_ai_request_model) (llama_stack_gen_ai_client_token_usage_sum{gen_ai_token_type="input"})
# Total output token usage by model
sum by(gen_ai_request_model) (llama_stack_gen_ai_client_token_usage_sum{gen_ai_token_type="output"})
# P95 HTTP server latency
histogram_quantile(0.95, rate(llama_stack_http_server_duration_milliseconds_bucket[5m]))
# P99 inference duration by model
histogram_quantile(0.99, rate(llama_stack_inference_duration_seconds_bucket[5m]))
# Tool invocation rate by tool name
rate(llama_stack_tool_runtime_invocations_total[5m])
# Vector insert throughput
rate(llama_stack_vector_io_inserts_total[5m])
GenAI message content capture
By default, prompt and response content is not captured for privacy. To enable content capture:
export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true
When enabled, content is emitted as log events (e.g., gen_ai.user.message, gen_ai.choice) with trace_id and span_id for correlation. Spans carry structured metadata (model, finish reason, token usage) but not the raw text.
Exporter configuration
- Console (debugging)
- OTLP Collector (production)
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true \
uv run opentelemetry-instrument \
--traces_exporter console \
--logs_exporter console \
-- \
python my_app.py
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true \
uv run opentelemetry-instrument \
--traces_exporter otlp \
--logs_exporter otlp \
-- \
python my_app.py
Jaeger ingests traces only, not logs. If you set OTEL_LOGS_EXPORTER=otlp and point it at Jaeger, logs will be rejected (404). Use an OTel Collector to route logs to a log-capable backend, or use OTEL_LOGS_EXPORTER=console for debugging.
Environment variables
| Variable | Default | Description |
|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | (none) | OTLP endpoint URL. Metrics and traces are only exported when set. |
OTEL_EXPORTER_OTLP_PROTOCOL | http/protobuf | Transport protocol for OTLP export |
OTEL_SERVICE_NAME | llama-stack | Service name tag on all telemetry data |
OTEL_METRIC_EXPORT_INTERVAL | 60000 | Metric export interval in milliseconds |
OTEL_PYTHON_DISABLED_INSTRUMENTATIONS | (none) | Comma-separated list of instrumentors to disable |
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT | false | Capture prompt/response text as log events |
Known issues
When OpenTelemetry auto-instrumentation is enabled, both the low-level database driver instrumentor
(e.g. asyncpg, sqlite3) and the SQLAlchemy ORM instrumentor activate simultaneously. This causes
every database operation to be traced twice - once at the ORM level and once at the raw protocol level.
The driver-level spans expose internal pool mechanics (such as connection health-check queries) that
inflate traces with noise. To prevent this, disable the driver-level instrumentors and rely on the
SQLAlchemy instrumentation alone:
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3,asyncpg"
The container image sets this automatically when any OTEL_* environment variable is present.
Related resources
- Setup script and configs - One-command observability stack deployment
- OpenTelemetry Documentation - Comprehensive observability framework
- OpenTelemetry GenAI Semantic Conventions - Standard GenAI metric naming
- Jaeger Documentation - Distributed tracing visualization
- Prometheus Documentation - Metrics storage and querying
- Grafana Documentation - Dashboard and visualization platform