Skip to main content

Llama Stack Observability: Metrics, Traces, and Dashboards with OpenTelemetry

· 7 min read

Running an LLM application in production is nothing like running a traditional web service. Responses are non-deterministic. Latency swings wildly with model size and token count. And failures are often silent — a tool call that returns garbage still comes back as a 200 OK. You can stare at your HTTP dashboard all day and have no idea that half your users are getting bad answers.

We recently shipped built-in observability for Llama Stack, powered by OpenTelemetry. Three environment variables, zero code changes, and you get metrics and traces from every layer — HTTP requests, inference calls, tool invocations, vector store operations, all the way down.

This post explains the architecture behind it, walks through a hands-on tutorial, and shows what you can actually see once it's running.

Llama Stack Achieves 100% Open Responses Compliance: Enterprise-Grade OpenAI Compatibility for Your Infrastructure

· 5 min read
Francisco Javier Arceo
Llama Stack Core Team
Charlie Doern
Llama Stack Core Team

We're excited to share that Llama Stack has achieved 100% compliance with the Open Responses specification and been officially recognized as part of the Open Responses community. This milestone represents more than just compatibility: it's about bringing enterprise-grade AI capabilities to your own infrastructure with the familiarity of OpenAI APIs.

With comprehensive support for Files, Vector Stores, Search, Conversations, Prompts, Chat Completions, the full Responses API, plus powerful extensions like MCP tool integration, Tool Calling, and Connectors, Llama Stack offers something unique in the AI infrastructure landscape: a SaaS-like experience that runs entirely on your terms.

Your Agent, Your Rules: Building Powerful Agents with the Responses API in Llama Stack

· 5 min read

The Responses API is rapidly emerging as one of the most influential interfaces for building AI agents. It handles multi-step reasoning, tool orchestration, and conversational state in a single interaction, which is a big improvement over the manual orchestration loops that developers had to build on top of chat completion APIs. Llama Stack's implementation of the Responses API brings these capabilities to the open source world, where you can choose your own models and run on your own infrastructure.

This post covers why the Responses API matters, what Llama Stack's implementation enables, and how it connects to the broader move toward open agent standards like Open Responses.

User Service v2 — Design Document

· 2 min read

Author: Platform Team · Status: Approved · Last updated: 2025-01-15

Overview

User Service v2 is the central identity and profile service for the Shopwave e-commerce platform. It owns user registration, login, profile management, and session lifecycle. v2 replaces the monolithic accounts module that lived inside the Rails checkout app.

Authentication

All authentication flows go through the auth gateway (gateway.shopwave.internal). The gateway issues JWT tokens signed with RS256 (RSA 2048-bit keys rotated quarterly). Access tokens have a 15-minute TTL; refresh tokens last 30 days and are stored in an HTTP-only secure cookie.

Token verification is handled by a shared middleware library (@shopwave/jwt-verify) that fetches the public key set from the gateway's /.well-known/jwks.json endpoint and caches it for 5 minutes.

Token claims

ClaimDescription
subUser UUID
emailVerified email address
rolesArray of role strings (customer, admin, support)
orgMerchant organization ID (multi-tenant)

Data Model

User records live in a PostgreSQL 16 cluster (users-primary.db.shopwave.internal). The schema is straightforward:

  • users — core identity (uuid, email, hashed_password, created_at)
  • profiles — display name, avatar URL, locale, timezone
  • sessions — active refresh tokens with device fingerprint and IP
  • audit_log — immutable append-only log of login, logout, and password-change events

We use row-level security (RLS) so each merchant organization can only see its own users. The org claim in the JWT maps directly to the RLS policy.

API Surface

The service exposes a gRPC API internally and an OpenAPI REST gateway for the storefront. Key endpoints:

MethodPathDescription
POST/v2/auth/registerCreate account
POST/v2/auth/loginIssue tokens
POST/v2/auth/refreshRotate access token
GET/v2/users/{id}/profileRead profile
PATCH/v2/users/{id}/profileUpdate profile
DELETE/v2/users/{id}GDPR deletion request

Rate limits: 20 requests/second per IP on auth endpoints, 100 req/s on profile reads.

Deployment

User Service v2 runs as a Kubernetes Deployment in the platform namespace (us-east-1 and eu-west-1 regions). Each region has 3 replicas behind an internal ALB. The Docker image is built in CI and pushed to our private ECR registry.

Health checks:

  • Liveness: /healthz (checks process is up)
  • Readiness: /readyz (checks DB connection pool + auth gateway reachability)

Dependencies

  • PostgreSQL 16 (RDS Multi-AZ)
  • Redis 7 (ElastiCache) for session caching and rate limiting
  • Auth Gateway (internal, runs in the same cluster)
  • Kafka (user-events topic) for publishing registration and deletion events

Open Questions

  • Should we migrate to passkeys (WebAuthn) for passwordless login? Currently scoped for Q3 2025.
  • Connection pool sizing needs revisiting after the February checkout outage (see postmortem).

Postmortem: 2025-01 Search Indexing Incident

· 2 min read

Date: 2025-01-08 · Duration: 2 hours 15 minutes · Severity: SEV-2 Author: Anika Patel (Search Team) · Status: Action items complete

Summary

On January 8, 2025 at 09:45 UTC, the product search service began returning stale results for approximately 60% of queries. New products added in the previous 12 hours were not appearing in search, and price updates were not reflected. The incident lasted 2 hours 15 minutes until the indexing pipeline was repaired at 12:00 UTC.

Root Cause

Elasticsearch bulk indexing failures caused by a mapping conflict after a schema change was deployed without a corresponding index migration.

The catalog team deployed a change that added a variants field (nested object type) to the product schema. However, the existing Elasticsearch index had variants mapped as a keyword field from a previous prototype that was never cleaned up. The bulk indexer silently dropped documents that contained the new nested variants structure, logging warnings but not raising alerts.

Over 12 hours, roughly 18,000 product documents failed to index.

Timeline

Time (UTC)Event
2025-01-07 21:30Catalog team deploys product schema change (adds nested variants)
2025-01-07 21:32Bulk indexer begins logging mapper_parsing_exception warnings
2025-01-08 09:45Customer support tickets spike — "new products not showing in search"
2025-01-08 09:55Search team investigates; discovers indexing error logs
2025-01-08 10:20Root cause identified: mapping conflict on variants field
2025-01-08 10:45Fix: create new index with correct mapping, reindex from catalog DB
2025-01-08 11:50Reindexing completes; alias swapped to new index
2025-01-08 12:00Search results verified; incident resolved

Contributing Factors

  1. No schema migration for Elasticsearch — The catalog team updated the application schema but did not run a corresponding ES index migration. The deploy checklist did not include search index compatibility checks.
  2. Silent failures — The bulk indexer logged warnings for mapping conflicts but did not alert or increment an error metric. The warnings were lost in log noise.
  3. No freshness monitoring — We had no alert for "time since last successful index update." A 12-hour gap went unnoticed.
  4. Leftover prototype mapping — The variants keyword field was added during a prototype 6 months ago and never removed.

Action Items

#ActionOwnerStatus
1Add ES mapping compatibility check to CI pipelineSearch Team✅ Done
2Convert bulk indexer warnings to errors + PagerDuty alertSearch Team✅ Done
3Add search freshness alert (warn if no docs indexed in 1 hour)SRE✅ Done
4Audit ES indices for stale/prototype mappingsSearch Team✅ Done
5Add index migration step to deploy checklistCatalog Team✅ Done

Lessons Learned

  • Search indexes are part of the schema. Changing the application data model without updating the search mapping is equivalent to skipping a database migration.
  • Silent drops are worse than loud failures. The bulk indexer should have failed fast instead of silently skipping documents for 12 hours.
  • Monitor data freshness, not just availability. The search service was "up" the entire time — it just served stale data.

Postmortem: 2025-02 Checkout Outage

· 2 min read

Date: 2025-02-12 · Duration: 47 minutes · Severity: SEV-1 Author: Jordan Park (SRE) · Status: Action items complete

Summary

On February 12, 2025 at 14:23 UTC, the checkout service began returning HTTP 503 errors to approximately 35% of customers attempting to complete purchases. The incident lasted 47 minutes until a configuration fix was deployed at 15:10 UTC. Estimated revenue impact: ~$280K in lost or delayed orders.

Root Cause

Connection pool exhaustion in the payments service due to missing timeout configuration.

The payments service (payments-svc) connects to the Stripe API through an internal connection pool (HikariCP, max pool size = 20). A Stripe API degradation at 14:20 UTC caused response times to increase from ~200ms to ~8 seconds. Because the pool had no connection timeout configured (default: infinite wait), threads waiting for a pool connection blocked indefinitely.

Within 3 minutes, all 20 connections were occupied by slow Stripe calls, and new checkout requests queued behind them. The queue grew until the service hit its thread limit (200 threads), at which point Kubernetes health checks started failing and pods entered CrashLoopBackOff.

The checkout service depends on payments-svc synchronously — when payments became unavailable, checkout returned 503.

Timeline

Time (UTC)Event
14:20Stripe API latency increases (p99 200ms → 8s)
14:23Payments-svc connection pool saturates; checkout errors begin
14:25PagerDuty fires checkout-error-rate alert
14:28On-call SRE acknowledges; begins investigation
14:35Root cause identified: payments-svc thread dump shows all threads blocked on HikariCP getConnection()
14:42Attempted fix: increase pool size to 50 — did not help (Stripe still slow)
14:55Correct fix identified: add connectionTimeout=5000 to HikariCP config
15:02Config change deployed via ConfigMap update + rolling restart
15:10All pods healthy; error rate returns to baseline
15:15Incident resolved; monitoring confirmed stable

Contributing Factors

  1. No connection timeout — HikariCP defaults to 30 seconds, but our config explicitly set it to 0 (infinite) based on a years-old tuning guide that prioritized throughput over resilience.
  2. No circuit breaker — The payments service had no circuit breaker on the Stripe integration, so it kept sending requests to a degraded upstream.
  3. Synchronous dependency — Checkout blocks on payments; there is no async fallback or queue-based decoupling.
  4. Monitoring gap — We had alerts on checkout error rate but not on payments-svc connection pool utilization.

Action Items

#ActionOwnerStatus
1Set connectionTimeout=5000 and maximumPoolSize=30 on all HikariCP poolsPlatform Team✅ Done
2Add circuit breaker (Resilience4j) to Stripe integration in payments-svcPayments Team✅ Done
3Add Grafana alert on HikariCP active connections > 80% of pool sizeSRE✅ Done
4Evaluate async checkout flow (publish to SQS, process payment async)Checkout Team🔄 In progress (Q2 target)
5Audit all services for missing timeout configurationsPlatform Team✅ Done

Lessons Learned

  • Timeouts are not optional. Every connection pool, HTTP client, and RPC call must have an explicit timeout. "Infinite" is never the right default for production.
  • Pool exhaustion cascades fast. A 20-connection pool with no timeout can go from healthy to fully blocked in under 3 minutes during an upstream degradation.
  • Monitor pool internals, not just request outcomes. We caught the error rate spike quickly, but could have caught the pool saturation 2 minutes earlier if we'd been monitoring HikariCP metrics.

Deployment Rollback Runbook

· 2 min read

Owner: Platform SRE · Last updated: 2025-02-28

When to Use This Runbook

Use this procedure when a production deployment causes user-facing issues and you need to revert to the previous known-good state. Common triggers:

  • Error rate spikes above 1% on any service (PagerDuty alert svc-error-rate)
  • Latency p99 exceeds SLO for more than 5 minutes
  • Canary deployment fails automated smoke tests

Prerequisites

  • kubectl configured for the target cluster (shopwave-prod-us-east-1 or shopwave-prod-eu-west-1)
  • Membership in the sre-oncall or platform-eng RBAC group
  • Access to the #deploy Slack channel for coordination

Rollback Procedure

Step 1 — Confirm the Bad Deployment

# See current and previous revisions
kubectl rollout history deployment/<service-name> -n <namespace>

Verify that the most recent revision matches the deployment you want to revert.

Step 2 — Revert the Kubernetes Deployment

Revert the Kubernetes deployment to the previous revision using kubectl rollout undo:

kubectl rollout undo deployment/<service-name> -n <namespace>

To roll back to a specific revision (not just the previous one):

kubectl rollout undo deployment/<service-name> -n <namespace> --to-revision=<N>

Step 3 — Verify the Rollback

# Watch rollout progress
kubectl rollout status deployment/<service-name> -n <namespace>

# Confirm the running image
kubectl get deployment/<service-name> -n <namespace> \
-o jsonpath='{.spec.template.spec.containers[0].image}'

Check the service's /readyz endpoint returns 200 and error rates are returning to baseline in Grafana.

Step 4 — Notify the Team

Post in #deploy:

🚨 Rolled back <service-name> in <region> from revision N to revision N-1. Reason: . Investigating root cause.

Tag the on-call engineer and link to the relevant PagerDuty incident.

Database Migrations

If the bad deployment included a database migration, rolling back the Kubernetes deployment alone is not sufficient. You must also revert the migration:

  1. Check schema_migrations for the most recent migration version.
  2. Run the down migration: python manage.py migrate <app> <previous_version>
  3. Verify schema state matches the reverted application code.

⚠️ Irreversible migrations (e.g., column drops) cannot be rolled back this way. If the migration was destructive, escalate to the database team immediately.

Post-Rollback

  • File a postmortem if the incident lasted more than 15 minutes or affected more than 0.1% of requests.
  • Update the deployment ticket in Jira with the rollback details.
  • Schedule a blameless review within 48 hours.

Contacts

RoleSlack handle
SRE on-call@sre-oncall
Platform lead@maria.chen
Database team@db-oncall
Incident commanderRotating — check PagerDuty

Building a Self-Improving Agent with Llama Stack

· 7 min read
Raghotham Murthy
Llama Stack Core Team

What if your AI agent could improve itself? Most agent tutorials show a single loop — user asks a question, the agent calls some tools, returns an answer. But what happens when you need to systematically improve your agent's behavior over time?

In this post, we build a ResearchAgent that answers questions from an internal engineering knowledge base — and gets better at it automatically. The agent uses the Responses API agentic loop with file_search and client-side tools to research questions, and it owns its own system prompt. Every N calls, it benchmarks itself by using a different model to judge the results, and rewrites its own prompt via the Prompts API.

This is literally self-referential: a Llama Stack agent evaluating and improving itself using the Responses API, Prompts API, and Vector Stores as its toolkit.