Blog | Llama Stack

Llama Stack Observability: Metrics, Traces, and Dashboards with OpenTelemetry

March 30, 2026 · 7 min read

Running an LLM application in production is nothing like running a traditional web service. Responses are non-deterministic. Latency swings wildly with model size and token count. And failures are often silent — a tool call that returns garbage still comes back as a 200 OK. You can stare at your HTTP dashboard all day and have no idea that half your users are getting bad answers.

We recently shipped built-in observability for Llama Stack, powered by OpenTelemetry. Three environment variables, zero code changes, and you get metrics and traces from every layer — HTTP requests, inference calls, tool invocations, vector store operations, all the way down.

This post explains the architecture behind it, walks through a hands-on tutorial, and shows what you can actually see once it's running.

Llama Stack Achieves 100% Open Responses Compliance: Enterprise-Grade OpenAI Compatibility for Your Infrastructure

March 20, 2026 · 5 min read

Francisco Javier Arceo

Llama Stack Core Team

Charlie Doern

Llama Stack Core Team

We're excited to share that Llama Stack has achieved 100% compliance with the Open Responses specification and been officially recognized as part of the Open Responses community. This milestone represents more than just compatibility: it's about bringing enterprise-grade AI capabilities to your own infrastructure with the familiarity of OpenAI APIs.

With comprehensive support for Files, Vector Stores, Search, Conversations, Prompts, Chat Completions, the full Responses API, plus powerful extensions like MCP tool integration, Tool Calling, and Connectors, Llama Stack offers something unique in the AI infrastructure landscape: a SaaS-like experience that runs entirely on your terms.

Your Agent, Your Rules: Building Powerful Agents with the Responses API in Llama Stack

March 18, 2026 · 5 min read

Bill Murdock

The Responses API is rapidly emerging as one of the most influential interfaces for building AI agents. It handles multi-step reasoning, tool orchestration, and conversational state in a single interaction, which is a big improvement over the manual orchestration loops that developers had to build on top of chat completion APIs. Llama Stack's implementation of the Responses API brings these capabilities to the open source world, where you can choose your own models and run on your own infrastructure.

This post covers why the Responses API matters, what Llama Stack's implementation enables, and how it connects to the broader move toward open agent standards like Open Responses.

User Service v2 — Design Document

March 16, 2026 · 2 min read

Author: Platform Team · Status: Approved · Last updated: 2025-01-15

Overview

User Service v2 is the central identity and profile service for the Shopwave e-commerce platform. It owns user registration, login, profile management, and session lifecycle. v2 replaces the monolithic accounts module that lived inside the Rails checkout app.

Authentication

All authentication flows go through the auth gateway (gateway.shopwave.internal). The gateway issues JWT tokens signed with RS256 (RSA 2048-bit keys rotated quarterly). Access tokens have a 15-minute TTL; refresh tokens last 30 days and are stored in an HTTP-only secure cookie.

Token verification is handled by a shared middleware library (@shopwave/jwt-verify) that fetches the public key set from the gateway's /.well-known/jwks.json endpoint and caches it for 5 minutes.

Token claims

Claim	Description
`sub`	User UUID
`email`	Verified email address
`roles`	Array of role strings (`customer`, `admin`, `support`)
`org`	Merchant organization ID (multi-tenant)

Data Model

User records live in a PostgreSQL 16 cluster (users-primary.db.shopwave.internal). The schema is straightforward:

users — core identity (uuid, email, hashed_password, created_at)
profiles — display name, avatar URL, locale, timezone
sessions — active refresh tokens with device fingerprint and IP
audit_log — immutable append-only log of login, logout, and password-change events

We use row-level security (RLS) so each merchant organization can only see its own users. The org claim in the JWT maps directly to the RLS policy.

API Surface

The service exposes a gRPC API internally and an OpenAPI REST gateway for the storefront. Key endpoints:

Method	Path	Description
POST	`/v2/auth/register`	Create account
POST	`/v2/auth/login`	Issue tokens
POST	`/v2/auth/refresh`	Rotate access token
GET	`/v2/users/{id}/profile`	Read profile
PATCH	`/v2/users/{id}/profile`	Update profile
DELETE	`/v2/users/{id}`	GDPR deletion request

Rate limits: 20 requests/second per IP on auth endpoints, 100 req/s on profile reads.

Deployment

User Service v2 runs as a Kubernetes Deployment in the platform namespace (us-east-1 and eu-west-1 regions). Each region has 3 replicas behind an internal ALB. The Docker image is built in CI and pushed to our private ECR registry.

Health checks:

Liveness: /healthz (checks process is up)
Readiness: /readyz (checks DB connection pool + auth gateway reachability)

Dependencies

PostgreSQL 16 (RDS Multi-AZ)
Redis 7 (ElastiCache) for session caching and rate limiting
Auth Gateway (internal, runs in the same cluster)
Kafka (user-events topic) for publishing registration and deletion events

Open Questions

Should we migrate to passkeys (WebAuthn) for passwordless login? Currently scoped for Q3 2025.
Connection pool sizing needs revisiting after the February checkout outage (see postmortem).

Postmortem: 2025-01 Search Indexing Incident

March 16, 2026 · 2 min read

Date: 2025-01-08 · Duration: 2 hours 15 minutes · Severity: SEV-2 Author: Anika Patel (Search Team) · Status: Action items complete

Summary

On January 8, 2025 at 09:45 UTC, the product search service began returning stale results for approximately 60% of queries. New products added in the previous 12 hours were not appearing in search, and price updates were not reflected. The incident lasted 2 hours 15 minutes until the indexing pipeline was repaired at 12:00 UTC.

Root Cause

Elasticsearch bulk indexing failures caused by a mapping conflict after a schema change was deployed without a corresponding index migration.

The catalog team deployed a change that added a variants field (nested object type) to the product schema. However, the existing Elasticsearch index had variants mapped as a keyword field from a previous prototype that was never cleaned up. The bulk indexer silently dropped documents that contained the new nested variants structure, logging warnings but not raising alerts.

Over 12 hours, roughly 18,000 product documents failed to index.

Timeline

Time (UTC)	Event
2025-01-07 21:30	Catalog team deploys product schema change (adds nested `variants`)
2025-01-07 21:32	Bulk indexer begins logging `mapper_parsing_exception` warnings
2025-01-08 09:45	Customer support tickets spike — "new products not showing in search"
2025-01-08 09:55	Search team investigates; discovers indexing error logs
2025-01-08 10:20	Root cause identified: mapping conflict on `variants` field
2025-01-08 10:45	Fix: create new index with correct mapping, reindex from catalog DB
2025-01-08 11:50	Reindexing completes; alias swapped to new index
2025-01-08 12:00	Search results verified; incident resolved

Contributing Factors

No schema migration for Elasticsearch — The catalog team updated the application schema but did not run a corresponding ES index migration. The deploy checklist did not include search index compatibility checks.
Silent failures — The bulk indexer logged warnings for mapping conflicts but did not alert or increment an error metric. The warnings were lost in log noise.
No freshness monitoring — We had no alert for "time since last successful index update." A 12-hour gap went unnoticed.
Leftover prototype mapping — The variants keyword field was added during a prototype 6 months ago and never removed.

Action Items

#	Action	Owner	Status
1	Add ES mapping compatibility check to CI pipeline	Search Team	✅ Done
2	Convert bulk indexer warnings to errors + PagerDuty alert	Search Team	✅ Done
3	Add search freshness alert (warn if no docs indexed in 1 hour)	SRE	✅ Done
4	Audit ES indices for stale/prototype mappings	Search Team	✅ Done
5	Add index migration step to deploy checklist	Catalog Team	✅ Done

Lessons Learned

Search indexes are part of the schema. Changing the application data model without updating the search mapping is equivalent to skipping a database migration.
Silent drops are worse than loud failures. The bulk indexer should have failed fast instead of silently skipping documents for 12 hours.
Monitor data freshness, not just availability. The search service was "up" the entire time — it just served stale data.

Postmortem: 2025-02 Checkout Outage

March 16, 2026 · 2 min read

Date: 2025-02-12 · Duration: 47 minutes · Severity: SEV-1 Author: Jordan Park (SRE) · Status: Action items complete

Summary

On February 12, 2025 at 14:23 UTC, the checkout service began returning HTTP 503 errors to approximately 35% of customers attempting to complete purchases. The incident lasted 47 minutes until a configuration fix was deployed at 15:10 UTC. Estimated revenue impact: ~$280K in lost or delayed orders.

Root Cause

Connection pool exhaustion in the payments service due to missing timeout configuration.

The payments service (payments-svc) connects to the Stripe API through an internal connection pool (HikariCP, max pool size = 20). A Stripe API degradation at 14:20 UTC caused response times to increase from ~200ms to ~8 seconds. Because the pool had no connection timeout configured (default: infinite wait), threads waiting for a pool connection blocked indefinitely.

Within 3 minutes, all 20 connections were occupied by slow Stripe calls, and new checkout requests queued behind them. The queue grew until the service hit its thread limit (200 threads), at which point Kubernetes health checks started failing and pods entered CrashLoopBackOff.

The checkout service depends on payments-svc synchronously — when payments became unavailable, checkout returned 503.

Timeline

Time (UTC)	Event
14:20	Stripe API latency increases (p99 200ms → 8s)
14:23	Payments-svc connection pool saturates; checkout errors begin
14:25	PagerDuty fires `checkout-error-rate` alert
14:28	On-call SRE acknowledges; begins investigation
14:35	Root cause identified: payments-svc thread dump shows all threads blocked on HikariCP `getConnection()`
14:42	Attempted fix: increase pool size to 50 — did not help (Stripe still slow)
14:55	Correct fix identified: add `connectionTimeout=5000` to HikariCP config
15:02	Config change deployed via ConfigMap update + rolling restart
15:10	All pods healthy; error rate returns to baseline
15:15	Incident resolved; monitoring confirmed stable

Contributing Factors

No connection timeout — HikariCP defaults to 30 seconds, but our config explicitly set it to 0 (infinite) based on a years-old tuning guide that prioritized throughput over resilience.
No circuit breaker — The payments service had no circuit breaker on the Stripe integration, so it kept sending requests to a degraded upstream.
Synchronous dependency — Checkout blocks on payments; there is no async fallback or queue-based decoupling.
Monitoring gap — We had alerts on checkout error rate but not on payments-svc connection pool utilization.

Action Items

#	Action	Owner	Status
1	Set `connectionTimeout=5000` and `maximumPoolSize=30` on all HikariCP pools	Platform Team	✅ Done
2	Add circuit breaker (Resilience4j) to Stripe integration in payments-svc	Payments Team	✅ Done
3	Add Grafana alert on HikariCP active connections > 80% of pool size	SRE	✅ Done
4	Evaluate async checkout flow (publish to SQS, process payment async)	Checkout Team	🔄 In progress (Q2 target)
5	Audit all services for missing timeout configurations	Platform Team	✅ Done

Lessons Learned

Timeouts are not optional. Every connection pool, HTTP client, and RPC call must have an explicit timeout. "Infinite" is never the right default for production.
Pool exhaustion cascades fast. A 20-connection pool with no timeout can go from healthy to fully blocked in under 3 minutes during an upstream degradation.
Monitor pool internals, not just request outcomes. We caught the error rate spike quickly, but could have caught the pool saturation 2 minutes earlier if we'd been monitoring HikariCP metrics.

Deployment Rollback Runbook

March 16, 2026 · 2 min read

Owner: Platform SRE · Last updated: 2025-02-28

When to Use This Runbook

Use this procedure when a production deployment causes user-facing issues and you need to revert to the previous known-good state. Common triggers:

Error rate spikes above 1% on any service (PagerDuty alert svc-error-rate)
Latency p99 exceeds SLO for more than 5 minutes
Canary deployment fails automated smoke tests

Prerequisites

kubectl configured for the target cluster (shopwave-prod-us-east-1 or shopwave-prod-eu-west-1)
Membership in the sre-oncall or platform-eng RBAC group
Access to the #deploy Slack channel for coordination

Rollback Procedure

Step 1 — Confirm the Bad Deployment

# See current and previous revisions
kubectl rollout history deployment/<service-name> -n <namespace>

Verify that the most recent revision matches the deployment you want to revert.

Step 2 — Revert the Kubernetes Deployment

Revert the Kubernetes deployment to the previous revision using kubectl rollout undo:

kubectl rollout undo deployment/<service-name> -n <namespace>

To roll back to a specific revision (not just the previous one):

kubectl rollout undo deployment/<service-name> -n <namespace> --to-revision=<N>

Step 3 — Verify the Rollback

# Watch rollout progress
kubectl rollout status deployment/<service-name> -n <namespace>

# Confirm the running image
kubectl get deployment/<service-name> -n <namespace> \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

Check the service's /readyz endpoint returns 200 and error rates are returning to baseline in Grafana.

Step 4 — Notify the Team

Post in #deploy:

🚨 Rolled back <service-name> in <region> from revision N to revision N-1. Reason: . Investigating root cause.

Tag the on-call engineer and link to the relevant PagerDuty incident.

Database Migrations

If the bad deployment included a database migration, rolling back the Kubernetes deployment alone is not sufficient. You must also revert the migration:

Check schema_migrations for the most recent migration version.
Run the down migration: python manage.py migrate <app> <previous_version>
Verify schema state matches the reverted application code.

⚠️ Irreversible migrations (e.g., column drops) cannot be rolled back this way. If the migration was destructive, escalate to the database team immediately.

Post-Rollback

File a postmortem if the incident lasted more than 15 minutes or affected more than 0.1% of requests.
Update the deployment ticket in Jira with the rollback details.
Schedule a blameless review within 48 hours.

Contacts

Role	Slack handle
SRE on-call	`@sre-oncall`
Platform lead	`@maria.chen`
Database team	`@db-oncall`
Incident commander	Rotating — check PagerDuty

Building a Self-Improving Agent with Llama Stack

March 1, 2026 · 7 min read

Raghotham Murthy

Llama Stack Core Team

What if your AI agent could improve itself? Most agent tutorials show a single loop — user asks a question, the agent calls some tools, returns an answer. But what happens when you need to systematically improve your agent's behavior over time?

In this post, we build a ResearchAgent that answers questions from an internal engineering knowledge base — and gets better at it automatically. The agent uses the Responses API agentic loop with file_search and client-side tools to research questions, and it owns its own system prompt. Every N calls, it benchmarks itself by using a different model to judge the results, and rewrites its own prompt via the Prompts API.

This is literally self-referential: a Llama Stack agent evaluating and improving itself using the Responses API, Prompts API, and Vector Stores as its toolkit.

How to Get Started with Llama Stack

January 30, 2026 · 3 min read

Charlie Doern

Llama Stack Core Team

Nathan Weinberg

Llama Stack Team

Core Team

There is no shortage of GenAI hosted services like OpenAI, Gemini, and Bedrock.

Introducing Llama Stack - The Open-Source Platform for Building AI Applications

January 22, 2026 · 3 min read

Llama Stack Team

Core Team

Welcome to our blog!

We're excited to introduce you to Llama Stack - the open-source platform that simplifies building production-ready generative AI applications.

Overview​

Authentication​

Token claims​

Data Model​

API Surface​

Deployment​

Dependencies​

Open Questions​

Summary​

Root Cause​

Timeline​

Contributing Factors​

Action Items​

Lessons Learned​

Summary​

Root Cause​

Timeline​

Contributing Factors​

Action Items​

Lessons Learned​

When to Use This Runbook​

Prerequisites​

Rollback Procedure​

Step 1 — Confirm the Bad Deployment​

Step 2 — Revert the Kubernetes Deployment​

Step 3 — Verify the Rollback​

Step 4 — Notify the Team​

Database Migrations​

Post-Rollback​

Contacts​

Overview

Authentication

Token claims

Data Model

API Surface

Deployment

Dependencies

Open Questions

Summary

Root Cause

Timeline

Contributing Factors

Action Items

Lessons Learned

Summary

Root Cause

Timeline

Contributing Factors

Action Items

Lessons Learned

When to Use This Runbook

Prerequisites

Rollback Procedure

Step 1 — Confirm the Bad Deployment

Step 2 — Revert the Kubernetes Deployment

Step 3 — Verify the Rollback

Step 4 — Notify the Team

Database Migrations

Post-Rollback

Contacts