Llama Stack Blog

Tracing LlamaStack Applications with MLflow: SDK vs OTel Collector

2026-04-06T00:00:00.000Z

As LLM-powered applications grow in complexity, observability becomes essential. You need to understand what your application is doing — what prompts are being sent, what responses come back, how long each call takes, and how many tokens are consumed. MLflow provides a powerful tracing framework that captures all of this, which can be integrated with llamastack for observability.

In this post, we'll walk through two approaches for exporting LlamaStack traces into MLflow:

MLflow SDK — Direct instrumentation using MLflow's built-in tracing and autologging
OTel Collector — Decoupled telemetry pipeline using OpenTelemetry auto-instrumentation and an OTel Collector as the intermediary

By the end, you'll understand when to use each approach and how to set them up.

MLflow Tracing: A Quick Overview

MLflow is an open-source platform for managing the ML lifecycle. Starting with version 2.14, MLflow introduced GenAI tracing — a first-class feature for capturing LLM interactions including:

Traces and Spans: Hierarchical representation of operations (API calls, tool invocations, chain steps)
Token Usage & Latency: Automatic capture of input/output tokens and response times
Input/Output Logging: Full request and response payloads for debugging
Web UI: A built-in dashboard for exploring, filtering, and analyzing traces

MLflow also supports ingesting traces via the OpenTelemetry (OTLP) protocol at its /v1/traces endpoint, which opens the door to vendor-neutral instrumentation, more on that in the OTel Collector section.

LlamaStack and Its OpenAI-Compatible API

LlamaStack provides an OpenAI-compatible API, meaning any tooling that works with OpenAI's chat completions API or responses API also works with LlamaStack. This is key for tracing, we can leverage existing OpenAI instrumentation libraries (both MLflow's openai.autolog() and OpenTelemetry's opentelemetry-instrumentation-openai-v2) to capture traces without writing custom code.

Architecture Overview

Before diving into the details, here's a high-level view of both approaches:

Prerequisites

For both approaches, you'll need:

Python 3.10+
A running LlamaStack server (e.g., at http://localhost:8321)
MLflow >= 3.10 with GenAI extras:

pip install "mlflow[genai]>=3.10" "openai>=2.20.0"

Start the MLflow Tracking Server

Launch a local MLflow server with SQLite as the backend store:

mlflow server \
  --backend-store-uri sqlite:///mlflow.db \
  --default-artifact-root ./mlruns \
  --host 0.0.0.0 --port 5000

The MLflow UI will be available at http://localhost:5000.

Approach 1: Tracing via MLflow SDK

This approach uses MLflow's native tracing SDK to capture and export traces directly to the MLflow server. It's the simplest way to get started.

Step 1: Instrument Your Code

Add MLflow tracing to your LlamaStack client code. The example below uses the Responses API (client.responses.create), which is the recommended way to interact with LlamaStack:

import mlflow
import mlflow.tracing as mlflow_tracing
from openai import OpenAI

# Configure MLflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("LlamaStack Demo")

# Enable tracing and OpenAI autologging
mlflow_tracing.enable()
mlflow.openai.autolog()

# Create an OpenAI-compatible client pointing to LlamaStack
client = OpenAI(
    base_url="http://localhost:8321/v1",
    api_key="fake",
)

response = client.responses.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    input="Give a one-sentence description of LlamaStack.",
)
print(response.output_text)

MLflow's openai.autolog() automatically captures every client.responses.create() call as a trace, including inputs, outputs, token usage, and latency — with zero additional instrumentation code.

Step 2: Run the Application

MLFLOW_TRACKING_URI=http://localhost:5000 \
python your_app.py

Step 3: View Traces in MLflow

Open http://localhost:5000, navigate to your experiment ("LlamaStack Demo"), and click the Traces tab. You'll see each request with:

Full input/output payloads
Token usage (input, output, total)
Latency breakdown
Span hierarchy

Pros and Cons

Aspect	Details
Simplicity	Minimal setup — just a few lines of Python
Rich data	MLflow autolog captures detailed OpenAI-specific metadata
Coupling	Application code depends on `mlflow` package
Flexibility	Traces go directly to MLflow — no intermediary routing or fan-out

Approach 2: Tracing via OTel Collector

This approach decouples instrumentation from the trace backend. The application uses OpenTelemetry auto-instrumentation to emit spans, which flow through an OTel Collector before being forwarded to MLflow's OTLP endpoint.

Step 1: Install OpenTelemetry Dependencies

pip install opentelemetry-api \
            opentelemetry-sdk \
            opentelemetry-exporter-otlp \
            opentelemetry-instrumentation-openai

Step 2: Configure the OTel Collector

Create an otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

exporters:
  otlphttp/mlflow:
    endpoint: http://host.docker.internal:5000
    traces_endpoint: /v1/traces
    tls:
      insecure: true
    headers:
      x-mlflow-experiment-id: "1"

processors:
  batch:
    timeout: 5s

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/mlflow]

Key configuration points:

receivers.otlp: Accepts OTLP data on port 4318 (HTTP)
exporters.otlphttp/mlflow: Forwards traces to MLflow's /v1/traces OTLP endpoint
x-mlflow-experiment-id: Determines which MLflow experiment receives the traces
host.docker.internal: Allows the containerized collector to reach the host machine's MLflow server

Step 3: Run the OTel Collector

docker run -d \
  --name otel-collector \
  -p 4318:4318 \
  -v $(pwd)/otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml \
  otel/opentelemetry-collector-contrib:0.143.1

Step 4: Run the Application with OTel Instrumentation

The key difference here: we use opentelemetry-instrument to wrap the application, and we disable MLflow's built-in tracing to avoid double-writing:

MLFLOW_ENABLE_TRACING=0 \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf \
OTEL_SERVICE_NAME=llamastack-app \
opentelemetry-instrument python your_app.py

Note that the application code itself does not need any MLflow imports for tracing. The OpenTelemetry auto-instrumentation handles span creation, and the collector handles routing.

Step 5: Verify Traces

Check that traces are flowing into MLflow:

# Via CLI
MLFLOW_TRACKING_URI=http://localhost:5000 \
mlflow traces search --experiment-id 1

# Or check the database directly
sqlite3 mlflow.db "SELECT count(*) FROM trace_info;"

Then open the MLflow UI at http://localhost:5000 to explore the traces visually.

Pros and Cons

Aspect	Details
Decoupling	App has no MLflow SDK dependency — only standard OTel
Fan-out	Collector can export to multiple backends simultaneously (e.g., Jaeger + MLflow)
Production-ready	OTel Collector provides buffering, retry, and batching
Complexity	Requires running and configuring an additional service (the collector)
Data richness	OTel OpenAI instrumentation may capture different fields than MLflow autolog

Responses API Support

Note: OpenTelemetry auto-instrumentation for the Responses API is not yet available upstream. Progress is tracked in llamastack/llama-stack#5192. In the meantime, Approach 1 (MLflow SDK) fully supports tracing Responses API calls via mlflow.openai.autolog().

Approach Comparison

Criteria	MLflow SDK	OTel Collector
Setup complexity	Low — a few lines of code	Medium — collector config + container
Code coupling	Coupled to `mlflow` package	No MLflow dependency in app code
Multi-backend support	MLflow only	Fan-out to any OTLP-compatible backend
Buffering & retry	Basic (in-process)	Production-grade (collector handles it)
Best for	Development, prototyping, quick experiments	Production deployments, multi-tool observability stacks
Instrumentation	`mlflow.openai.autolog()`	`opentelemetry-instrument` + OTel OpenAI plugin

Bonus: Tracing the LlamaStack Server Itself

Both approaches above trace the client side — the application making calls to LlamaStack. But you can also trace the LlamaStack server using the OTel approach:

OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:5000/v1/traces \
OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf \
OTEL_EXPORTER_OTLP_TRACES_HEADERS="x-mlflow-experiment-id=1" \
OTEL_SERVICE_NAME=llama-stack-server \
opentelemetry-instrument llama stack run starter

This gives you end-to-end visibility: client-side spans showing the request lifecycle, and server-side spans showing internal LlamaStack processing.

Important: MLflow SDK tracing only instruments the client side. The LlamaStack server itself is not instrumented by MLflow, so server-side spans (inference routing, tool execution, etc.) are only visible through OpenTelemetry auto-instrumentation (Approach 2).

Common Gotchas

MLflow OTLP endpoint path: Use /v1/traces, not /api/2.0/otlp/v1/traces (the latter returns 404 in MLflow 3.10+).
Double-writing: If you enable both MLflow autolog and OTel instrumentation, traces may be written twice. Set MLFLOW_ENABLE_TRACING=0 when using the OTel path.
Missing spans with OTel: The opentelemetry-instrument wrapper is required — simply setting OTEL_* environment variables without it won't produce any spans because no instrumentation is active.
Docker networking: When running the OTel Collector in Docker, use host.docker.internal to reach services on the host machine.
Time range in UI: If the MLflow UI looks empty, check the time range filter — it may default to a narrow window that excludes your traces.

Conclusion

Both approaches get your LlamaStack traces into MLflow, but they serve different needs:

Start with the MLflow SDK when you want quick, low-friction observability during development. A few lines of code and you're tracing.
Move to the OTel Collector when you need production-grade telemetry infrastructure — decoupled from your application, with the ability to fan out to multiple observability backends.

The good news: since LlamaStack exposes an OpenAI-compatible API, both paths leverage existing, well-maintained instrumentation libraries. You're not writing custom tracing code — you're plugging into an ecosystem.

Quick Start With Containers

A pending PR, feat: add MLflow support for LlamaStack, will let you run LlamaStack alongside MLflow, Grafana, and Prometheus in containers with a single command. Once that PR lands, use the telemetry scripts in the LlamaStack repository for the full walkthrough.

References

Llama Stack Observability: Metrics, Traces, and Dashboards with OpenTelemetry

2026-03-30T00:00:00.000Z

Running an LLM application in production is nothing like running a traditional web service. Responses are non-deterministic. Latency swings wildly with model size and token count. And failures are often silent — a tool call that returns garbage still comes back as a 200 OK. You can stare at your HTTP dashboard all day and have no idea that half your users are getting bad answers.

We recently shipped built-in observability for Llama Stack, powered by OpenTelemetry. Three environment variables, zero code changes, and you get metrics and traces from every layer — HTTP requests, inference calls, tool invocations, vector store operations, all the way down.

This post explains the architecture behind it, walks through a hands-on tutorial, and shows what you can actually see once it's running.

{/truncate/}

Why Observability Matters for LLM Applications

If you've operated traditional services, you know the drill: uptime checks, error rates, latency percentiles. LLM applications need all of that, plus a whole category of signals that don't exist in conventional backends.

Latency is multi-dimensional. A single /v1/responses call might fan out to an inference provider, three tool calls, and two vector store queries. Knowing the overall P99 is 4 seconds doesn't help you — you need to know which leg is slow.

Token economics drive cost. Without tracking tokens-per-second and usage patterns across models and providers, capacity planning is guesswork.

Time-to-first-token (TTFT) defines user experience. A streaming response with a 5-second TTFT feels broken to the user, even if total latency is fine.

Silent failures are common. A tool invocation that times out, a vector search that returns zero results, a safety shield that blocks unexpectedly — none of these produce HTTP errors, but all degrade quality. You won't find them in your access logs.

Provider comparison requires data. When you run multiple inference backends (vLLM, Ollama, OpenAI), you need apples-to-apples latency and reliability numbers, not vibes.

How We Instrumented Llama Stack

We chose OpenTelemetry (OTel) — the CNCF's vendor-neutral standard for metrics, traces, and logs. The practical upside: you export to Prometheus, Grafana, Jaeger, Datadog, or any OTLP-compatible backend, and you can switch without touching application code.

The instrumentation has two layers that work together. Both feed into the OpenTelemetry SDK, which batches and exports signals to the Collector via OTLP.

Auto instrumentation: the infrastructure view

Launch Llama Stack with the opentelemetry-instrument CLI wrapper and you get — with zero code changes:

Inbound HTTP spans and metrics from FastAPI (every API request)
Outbound HTTP spans and metrics from httpx (calls to inference providers)
Database query spans from SQLAlchemy and asyncpg
GenAI call spans from OTel ecosystem packages for OpenAI, Bedrock, Vertex AI, etc. — model name, token counts, finish reasons, all captured at the SDK level

This covers the "infrastructure view": request flow, provider latency, GenAI call details, database performance.

Manual instrumentation: the application view

Auto instrumentation doesn't know about Llama Stack's domain concepts. So we added manual instrumentation directly in the routers and middleware to capture:

API request metrics — total count, duration histogram, concurrent request gauge
Inference metrics — end-to-end duration, TTFT, tokens-per-second
Vector IO metrics — insert, query, and delete counts with duration
Tool runtime metrics — invocation count and duration by tool name and status
Safety spans — shield evaluation traces with attribute context

The two layers are complementary. Auto instrumentation tells you what's happening at the network and SDK level; manual instrumentation tells you what's happening at the application level. As a user, you don't need to care about any of this — just launch with opentelemetry-instrument and everything lights up.

The OpenTelemetry Collector

Between Llama Stack and your observability backends sits the OpenTelemetry Collector. It receives OTLP data, processes it, and fans out to one or more destinations.

The pipeline has three stages:

Receivers define how data enters. Llama Stack pushes to the OTLP receiver on port 4318 (HTTP) or 4317 (gRPC). You can run additional receivers in parallel — for example, a Prometheus scrape receiver for other services in your infrastructure.

Processors transform data in flight. The ones that matter for production: batch (groups telemetry for efficient network transfer), memory_limiter (drops data under memory pressure instead of OOM-ing), attributes (inject labels like environment=production), and filter (drop noise like health check spans). They run in the order you define them — a typical chain is memory_limiter → batch → attributes.

Exporters send data to backends. prometheusremotewrite for Prometheus-compatible stores, otlp for Jaeger/Tempo/Datadog, debug for stdout during development. A single Collector can export metrics to Prometheus for dashboarding AND to Datadog for alerting simultaneously.

The key benefit: Llama Stack only speaks OTLP. The Collector handles format conversion, retries, and routing. Swap backends without changing a line of application code.

End-to-End Data Flow

Here's what happens when a request comes in:

Two things worth noting. First, recording is non-blocking — metrics write to an in-memory buffer, so they add negligible latency to the request path. Second, export is batched — the SDK flushes every 60 seconds by default, which means dashboards have up to a minute of delay, but request handling is never blocked by network I/O to the Collector.

Hands-On Tutorial

Let's set everything up. By the end you'll have distributed tracing in Jaeger, metrics in Prometheus, and pre-built dashboards in Grafana.

Step 0: Prerequisites

You'll need:

Docker or Podman for running the observability stack
A working Llama Stack installation with uv

Clone the repo if you haven't:

git clone https://github.com/llamastack/llama-stack.git
cd llama-stack

The telemetry configs live in scripts/telemetry/:

File	What it does
`setup_telemetry.sh`	Starts all telemetry services
`otel-collector-config.yaml`	Collector pipeline config
`prometheus.yml`	Prometheus scrape config
`grafana-datasources.yaml`	Grafana datasource provisioning
`grafana-dashboards.yaml`	Grafana dashboard provisioning
`llama-stack-dashboard.json`	Pre-built Grafana dashboard

Step 1: Deploy the Observability Stack

One script brings up Jaeger, the OTel Collector, Prometheus, and Grafana:

# Auto-detect container runtime (podman or docker)
./scripts/telemetry/setup_telemetry.sh

# Or specify explicitly
./scripts/telemetry/setup_telemetry.sh --container docker

This creates a llama-telemetry container network, starts all four services, and provisions Grafana with a pre-built dashboard.

Step 2: Install OpenTelemetry Packages

uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -

opentelemetry-bootstrap detects your installed libraries (FastAPI, httpx, SQLAlchemy, OpenAI SDK, etc.) and installs the matching instrumentation packages automatically.

Step 3: Launch the Server

Set three environment variables and wrap the launch command with opentelemetry-instrument:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_SERVICE_NAME=llama-stack-server

uv run opentelemetry-instrument llama stack run starter

That's it. When OTEL_EXPORTER_OTLP_ENDPOINT is set, both auto and manual instrumentation activate. When it's not set, metrics are recorded in memory but never exported — no overhead, no errors.

Variable	Purpose	Example
`OTEL_EXPORTER_OTLP_ENDPOINT`	Collector endpoint	`http://localhost:4318`
`OTEL_EXPORTER_OTLP_PROTOCOL`	Transport protocol	`http/protobuf`
`OTEL_SERVICE_NAME`	Service name in telemetry	`llama-stack-server`
`OTEL_METRIC_EXPORT_INTERVAL`	Export interval (ms)	`60000` (default)

Tip: If you see duplicate database traces, set OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3,asyncpg" to disable overlapping instrumentors.

Step 4: Launch Your Client

To get end-to-end distributed tracing, launch your client the same way:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_SERVICE_NAME=my-llama-stack-app

opentelemetry-instrument python my_app.py

A minimal example:

from openai import OpenAI

client = OpenAI(api_key="fake", base_url="http://localhost:8321/v1/")

response = client.chat.completions.create(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.choices[0].message.content)

With opentelemetry-instrument, this client automatically generates GenAI spans (model, token counts, finish reasons) and HTTP spans, all correlated with server-side traces via W3C trace context propagation.

By default, message content (prompts, outputs, tool arguments) is not captured for privacy. To enable content capture for debugging:

OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true \
opentelemetry-instrument python my_app.py

Captured content appears as log events (gen_ai.user.message, gen_ai.choice) correlated with trace spans via trace_id/span_id. The spans themselves carry structured metadata (model, token usage, latency) but not the raw text.

Step 5: Explore the Data

Once traffic is flowing:

Service	URL	Credentials
Jaeger (traces)	http://localhost:16686	N/A
Prometheus (metrics)	http://localhost:9090	N/A
Grafana (dashboards)	http://localhost:3000	admin / admin

Jaeger: Distributed Traces

Select the llama-stack-server or my-llama-stack-app service to see request traces. Each trace shows the full request lifecycle — client HTTP call → FastAPI handler → inference provider call → database operations. You can pinpoint exactly where time is spent.

Prometheus: Metrics Queries

Some useful PromQL to get you started:

What you want to know	PromQL
Input token usage by model	`sum by(gen_ai_request_model) (llama_stack_gen_ai_client_token_usage_sum{gen_ai_token_type="input"})`
Output token usage by model	`sum by(gen_ai_request_model) (llama_stack_gen_ai_client_token_usage_sum{gen_ai_token_type="output"})`
P95 HTTP server latency	`histogram_quantile(0.95, rate(llama_stack_http_server_duration_milliseconds_bucket[5m]))`
P99 inference duration	`histogram_quantile(0.99, rate(llama_stack_inference_duration_seconds_bucket[5m]))`
P95 TTFT by model	`histogram_quantile(0.95, rate(llama_stack_inference_time_to_first_token_seconds_bucket[5m]))`
Median tokens/sec by provider	`histogram_quantile(0.5, rate(llama_stack_inference_tokens_per_second_bucket[5m]))`
Tool invocation errors	`rate(llama_stack_tool_runtime_invocations_total{status="error"}[5m])`

Grafana: Pre-built Dashboard

A Llama Stack dashboard is automatically provisioned with panels for prompt tokens, completion tokens, P95/P99 HTTP duration, and request volume. It's a starting point — extend it with the PromQL queries above for inference-specific views.

Step 6: Set Up Alerts

With metrics in Prometheus, you can set up alerts for the things that actually page you at 3 AM:

High latency: P99 inference duration > 10s sustained for 5 minutes
Error rate spike: Error rate > 5% over a 5-minute window
Provider down: Zero successful requests to a provider for 2 minutes
Capacity warning: Concurrent requests consistently above threshold

Cleanup

docker stop jaeger otel-collector prometheus grafana
docker rm jaeger otel-collector prometheus grafana
docker network rm llama-telemetry

What's Next

The instrumentation is in place, and we're planning to expand it. If you have ideas for metrics that would help you operate Llama Stack in production or if you've built interesting dashboards on top of what's there, we'd love to hear about it. Open an issue or check the contributing guide.

Llama Stack Achieves 100% Open Responses Compliance: Enterprise-Grade OpenAI Compatibility for Your Infrastructure

2026-03-20T00:00:00.000Z

We're excited to share that Llama Stack has achieved 100% compliance with the Open Responses specification and been officially recognized as part of the Open Responses community. This milestone represents more than just compatibility: it's about bringing enterprise-grade AI capabilities to your own infrastructure with the familiarity of OpenAI APIs.

With comprehensive support for Files, Vector Stores, Search, Conversations, Prompts, Chat Completions, the full Responses API, plus powerful extensions like MCP tool integration, Tool Calling, and Connectors, Llama Stack offers something unique in the AI infrastructure landscape: a SaaS-like experience that runs entirely on your terms.

{/truncate/}

Recognition by the Open Responses Community

The Open Responses initiative represents a collaborative effort to standardize agentic AI interfaces across the industry, with backing from OpenAI, Hugging Face, and leading providers like Ollama, vLLM, and LM Studio. Our acceptance into this community validates Llama Stack's commitment to open standards and interoperability.

What makes this recognition particularly meaningful is our approach to compliance. We don't just aim for compatibility—we run the full Open Responses acceptance test suite on every pull request as a blocking requirement. This means our perfect 6/6 test pass rate isn't a one-time achievement; it's a maintained standard that ensures consistent, reliable behavior for developers building on open standards.

Comprehensive OpenAI API Feature Support

Llama Stack delivers comprehensive feature parity across multiple API surfaces, giving you the full power of modern AI APIs.

A note on model IDs: The model ID you pass depends on the inference provider backing your Llama Stack server. For example, with Ollama you'd use ollama/llama3.2:3b, while with Fireworks or Together you'd use the HuggingFace-style meta-llama/Llama-3.2-3B-Instruct. The API calls are identical either way.

Files API - OpenAI-Compatible Document Management

Upload, manage, and process documents with the same interface you'd use with OpenAI:

# Works identically with OpenAI or Llama Stack clients
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")

file = client.files.create(file=open("document.pdf", "rb"), purpose="assistants")

# List and manage files
files = client.files.list()
content = client.files.content(file.id)

Vector Stores API - RAG Without Vendor Lock-in

Build retrieval-augmented generation applications using the full Vector Stores API:

# Create vector stores with nested file management
vector_store = client.vector_stores.create(name="knowledge_base")

# Add files and manage vector store content
client.vector_stores.files.create(vector_store_id=vector_store.id, file_id=file.id)

# Search functionality built-in
results = client.vector_stores.search(
    vector_store_id=vector_store.id, query="What is our refund policy?"
)

Conversations API - Persistent Context Management

Manage conversation state and continuity across interactions:

# Create a conversation
conversation = client.conversations.create()

# Add items to a conversation
client.conversations.items.create(
    conversation_id=conversation.id,
    items=[
        {"role": "user", "content": "Tell me about our product features"},
        {"role": "assistant", "content": "I'd be happy to explain..."},
    ],
)

# Retrieve conversation history
items = client.conversations.items.list(conversation_id=conversation.id)

Chat Completions & Responses - Simple Chat to Agentic Workflows

From straightforward inference to multi-tool orchestration:

# Standard chat completions (e.g., with Ollama)
completion = client.chat.completions.create(
    model="ollama/gpt-oss:20b", messages=[{"role": "user", "content": "Explain RAG"}]
)

# Advanced responses with tool orchestration (e.g., with Fireworks)
response = client.responses.create(
    model="ollama/gpt-oss:20b",
    input="What documents mention our pricing strategy?",
    tools=[{"type": "file_search"}],
)

Prompts API - Programmatic Prompt Management

Llama Stack extends OpenAI compatibility with full programmatic prompt management. With OpenAI, prompts are created through their admin portal and referenced by ID in the Responses API. Llama Stack provides the same referencing pattern, plus a complete CRUD API for creating and managing prompts programmatically:

from llama_stack_client import LlamaStackClient

ls_client = LlamaStackClient()

# Create reusable prompt templates with variables
prompt = ls_client.prompts.create(
    prompt="You are a {{ role }} assistant. Analyze this: {{ content }}",
    variables=["role", "content"],
)

# Reference prompts in responses — compatible with OpenAI's pattern
response = client.responses.create(
    model="ollama/gpt-oss:20b",
    input=[{"role": "user", "content": "Review our Q1 report"}],
    prompt={
        "id": prompt.prompt_id,
        "variables": {
            "role": {"type": "input_text", "text": "financial analyst"},
            "content": {"type": "input_text", "text": "Q1 2026 earnings report"},
        },
    },
)

This gives you the best of both worlds: compatibility with OpenAI's prompt referencing pattern in the Responses API, plus the ability to manage prompts as code rather than through a web interface.

MCP Integration - Extensible Tool Ecosystem

Leverage the Model Context Protocol to connect to any MCP server and dynamically discover tools:

# Connect to MCP servers for dynamic tool discovery
response = client.responses.create(
    model="ollama/gpt-oss:20b",
    input="What parks are in Rhode Island, and are there upcoming events?",
    tools=[
        {
            "type": "mcp",
            "server_label": "parks-service",
            "server_url": "http://parks-mcp-server:8000/sse",
        }
    ],
)

MCP tools support per-request authorization, allowed tool filtering, and automatic session management. Connect to databases, APIs, and internal services through the growing ecosystem of standard MCP servers—no custom integration work required.

Connectors - Declarative Service Integration

Connectors provide a configuration-driven approach to integrating external services with your Llama Stack deployment. Define your data sources and services in your stack configuration, and they're automatically available as tools for your agents to use.

The Value Proposition: SaaS Experience, Your Infrastructure

Data Sovereignty & Security

For regulated industries like finance, healthcare, and government, sending sensitive documents to external APIs isn't an option. Llama Stack solves this by running entirely on your infrastructure:

Documents never leave your environment: RAG pipelines, vector storage, and model inference all happen locally
Compliance-ready: Meet HIPAA, SOC 2, GDPR, and other regulatory requirements
Audit trails: Full visibility into data processing and model decisions

Cost Control & Predictability

Unlike consumption-based pricing models, Llama Stack offers:

Fixed infrastructure costs: Pay for compute, not tokens
No usage surprises: Predictable costs regardless of application load
Efficient resource utilization: Choose the right model size for your use case

Model Freedom

Break free from vendor-specific models:

# Same API, different models — swap without code changes
for model in ["ollama/gpt-oss:20b", "ollama/llama3.2:3b", "your-org/custom-model"]:
    response = client.chat.completions.create(model=model, messages=messages)

Getting Started in Minutes

Whether you're prototyping locally or deploying at scale, Llama Stack makes it easy:

Local Development

# Set up your environment
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install -U llama-stack
uv run llama stack list-deps starter | xargs -L1 uv pip install

# Start Ollama and pull a model
ollama serve
ollama run gpt-oss:20b

# Launch Llama Stack with the starter distribution

OLLAMA_URL=http://localhost:11434/v1 uv run llama stack run starter

# Use with the OpenAI client
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="none")

response = client.responses.create(
    model="ollama/gpt-oss:20b", input="Write a haiku about open source."
)
print(response.output_text)

Production Deployment

# Deploy with your preferred infrastructure
# Docker, Kubernetes, or bare metal — your choice
docker run -p 8321:8321 llamastack/distribution-starter:latest

Framework Ecosystem Compatibility

One of Llama Stack's biggest advantages is drop-in compatibility with existing tooling:

Direct OpenAI Client

from openai import OpenAI

# Same code, different backend
client = OpenAI(base_url="http://your-llama-stack/v1", api_key="none")

LangChain Integration

from langchain_openai import ChatOpenAI

# Point to your Llama Stack server
llm = ChatOpenAI(
    base_url="http://your-llama-stack/v1/openai/v1",
    api_key="none",
    model="ollama/gpt-oss:20b",
)

Native Llama Stack Client

from llama_stack_client import LlamaStackClient

# Access the full Llama Stack API surface
client = LlamaStackClient(base_url="http://your-llama-stack")

Built for Open Standards

Our 100% Open Responses compliance reflects a broader philosophy: open standards enable innovation. When you build on Llama Stack, you're not just adopting our implementation—you're investing in an ecosystem where:

Applications are portable: Move between providers without rewriting code
Standards evolve collaboratively: Community-driven development rather than vendor dictates
Innovation is shared: Improvements benefit the entire ecosystem

Technical Excellence Through Testing

Achieving 100% Open Responses compliance required rigorous engineering:

Perfect conformance testing: Every PR runs the full Open Responses test suite with 6/6 passing tests
Automated compliance validation: Blocking requirements ensure compliance is maintained, not achieved once
Production testing: Integration tests with real workloads and multi-provider scenarios
Comprehensive API coverage: Full implementation of the Open Responses specification

What's Next

Llama Stack's OpenAI compatibility is just the beginning. We're actively working on:

Enhanced streaming support: Improved real-time response handling
Extended MCP ecosystem: Deeper tool integration and connector development
Performance optimizations: Faster inference and better resource utilization
Broader OpenAI API coverage: Expanding compatibility beyond our current feature set

Join the Open AI Infrastructure Movement

Llama Stack represents something new in the AI infrastructure landscape: enterprise-grade capabilities without vendor lock-in. Whether you're a startup building your first AI application or an enterprise looking to bring AI workloads in-house, Llama Stack provides the reliability, security, and compatibility you need.

Ready to get started?

📚 Documentation: Comprehensive guides and API references
🚀 Getting Started: Quick setup tutorials
🔧 OpenAI Implementation Guide: Detailed compatibility examples
🔌 MCP Integration: Tool ecosystem and connector guides
💬 Community: Join discussions and contribute

The future of AI infrastructure is open, interoperable, and under your control. Welcome to Llama Stack.

Your Agent, Your Rules: Building Powerful Agents with the Responses API in Llama Stack

2026-03-18T00:00:00.000Z

The Responses API is rapidly emerging as one of the most influential interfaces for building AI agents. It handles multi-step reasoning, tool orchestration, and conversational state in a single interaction, which is a big improvement over the manual orchestration loops that developers had to build on top of chat completion APIs. Llama Stack's implementation of the Responses API brings these capabilities to the open source world, where you can choose your own models and run on your own infrastructure.

This post covers why the Responses API matters, what Llama Stack's implementation enables, and how it connects to the broader move toward open agent standards like Open Responses.

{/truncate/}

Why the Responses API?

Before the Responses API, building an agent that could use tools was a multi-step exercise in client-side orchestration. Your application had to call the model with a list of available tools, inspect the response for tool call requests, execute those tools, send the results back, and repeat until the model produced a final answer. All of the state management, error handling, and retry logic lived in your code.

This approach put a real burden on application developers. The orchestration logic got duplicated across every application, and subtle mistakes in state management could lead to poor accuracy or unnecessary model calls.

The Responses API moves this orchestration to the server. The client sends a question along with a set of available tools and documents, and the server handles the planning, tool execution, and synthesis internally. Your client code gets much simpler, and the behavior is more consistent because the orchestration logic is shared rather than reimplemented by every application.

What Llama Stack brings to the table

Llama Stack is an open source server for building AI applications. It provides a unified set of APIs for inference, RAG, tool calling, safety, evaluation, and more, backed by a pluggable provider architecture that lets you swap components without changing application code.

Llama Stack implements the Responses API with support for built-in RAG through file_search, automated multi-tool orchestration through the Model Context Protocol (MCP), conversation state management, and compatibility with the OpenAI client ecosystem.

But the interesting part is what Llama Stack adds beyond the API surface itself.

Model freedom

With a proprietary hosted service, the Responses API is tied to a specific set of models from a single provider. With Llama Stack, you can use any model accessible through its inference providers: open source models like the Llama family, fine-tuned models you've created yourself, or optimized models from the broader ecosystem. The same Responses API interface works regardless of which model backs it. You can start with a small model during development, scale up for production, or swap models entirely, and your application code stays the same.

Data sovereignty

If you work in a regulated industry like finance, healthcare, or government, sending sensitive documents to a third-party cloud service is often a non-starter. Llama Stack lets you run the entire stack on your own infrastructure: the model, the vector store for RAG, and the tool execution environment. Documents stay within your security perimeter, and the agent's reasoning about those documents does too.

Open, extensible architecture

Llama Stack's provider architecture means you are not locked into a single implementation for any component. Need FAISS for your vector store in development and Milvus in production? Change a configuration setting. Want to use Ollama locally and a cloud inference provider in production? Same application code, different distribution. This flexibility extends across the full Llama Stack API surface, not just inference.

Private RAG with `file_search`

Retrieval-augmented generation (RAG) grounds a model's responses in authoritative documents, which reduces hallucination and enables accurate answers from private knowledge bases.

The Responses API formalizes RAG with the file_search tool. You create a vector store, upload documents to it, and then include file_search as an available tool when calling the Responses API. The model generates search queries, retrieves relevant passages, and synthesizes them into a grounded answer, all in a single API call.

With Llama Stack, this entire pipeline runs on your infrastructure. Document ingestion, embedding, storage, retrieval, and synthesis all happen locally. The response includes references to the source passages, so your application can provide citations for verification.

This makes it practical to build RAG applications over sensitive internal documents like compliance policies, medical records, or proprietary research, with confidence that the data never leaves your environment.

Multi-tool orchestration with MCP

The Responses API gets especially interesting when an agent needs to coordinate multiple tools to answer a complex question. Consider a question like: "What parks are in Rhode Island, and are there any upcoming events at them?" Answering this requires discovering available tools, searching for parks, querying events for each park found, and synthesizing all the results.

With Llama Stack's Responses API and MCP integration, this entire workflow happens within a single API call. The model discovers available tools from a connected MCP server, plans and executes a sequence of tool calls, and produces a consolidated answer. The client application doesn't need to write any orchestration logic.

MCP is an open standard for tool integration, so the ecosystem of available tools is broad and growing. Any MCP server can be connected to Llama Stack and used by the Responses API, whether it provides access to databases, internal services, or external data sources.

Llama Stack also provides fine-grained control over tool access. You can restrict which tools are available for a given request, pass per-request authentication headers to MCP servers so that an agent can only access data for the current user, and configure tool behavior without modifying the agent's prompt. This matters a lot in production deployments where security and access control are real concerns.

Framework compatibility

Llama Stack exposes OpenAI-compatible endpoints at /v1, so you can use the official OpenAI Python client, the Llama Stack client, or any other client that speaks the OpenAI API. They all work the same way.

If you have existing code built with the OpenAI client, migrating to Llama Stack means pointing your client at your Llama Stack server. That's it. This also applies to frameworks like LangChain that build on top of OpenAI's API. Switching the inference backend to Llama Stack requires changing a constructor parameter, not rewriting your agent logic.

This drop-in compatibility has practical implications beyond convenience. You can develop and test against a local Llama Stack server, deploy against a production Llama Stack distribution, or switch between Llama Stack and other OpenAI-compatible providers, all with the same application code.

Toward an open standard: Open Responses

When Llama Stack first implemented the Responses API, the specification was proprietary. Llama Stack had to track a moving target, and there was always a gap between when OpenAI added a feature and when Llama Stack could implement it.

The Open Responses specification changes this. Open Responses is an open source specification backed by a broad community including OpenAI, Hugging Face, and providers like Ollama, vLLM, and LM Studio. It formalizes the core concepts of the Responses API into an open standard: items as the atomic unit of context, semantic streaming events, and the agentic loop of reasoning and tool invocation.

For Llama Stack, Open Responses provides a stable, community-governed specification to build against rather than a proprietary one. It also means that Llama Stack's Responses API implementation is part of a broader ecosystem of interoperable providers. Applications built against the Open Responses specification can run on Llama Stack, on OpenAI, on Hugging Face's infrastructure, or on local providers like Ollama, without code changes.

The Open Responses specification also introduces concepts that matter for production deployments:

Reasoning visibility: The specification formalizes how models expose their reasoning process, which enables audit trails and governance workflows.
Internal vs. external tools: A clear distinction between tools executed within the provider's infrastructure (like file_search) and tools executed by the client, so developers know exactly where computation happens.
Extensibility without fragmentation: Providers can add custom capabilities while maintaining a stable, interoperable core.

For the Llama Stack community, this means that investing in the Responses API is about more than compatibility with one vendor. It's about building on an open standard that the industry is starting to converge around.

Getting started

If you're new to Llama Stack, the Getting Started guide will walk you through setting up a server with your preferred inference provider. From there, the OpenAI Implementation Guide has examples of using the Responses API for everything from simple text generation to multi-tool agentic workflows.

The Responses API is still evolving, both in Llama Stack and in the Open Responses specification, and contributions are welcome. Whether it's implementing new features, improving test coverage, or reporting issues, the project benefits from developers who are building real applications and sharing what they learn.

Building a Self-Improving Agent with Llama Stack

2026-03-01T00:00:00.000Z

What if your AI agent could improve itself? Most agent tutorials show a single loop — user asks a question, the agent calls some tools, returns an answer. But what happens when you need to systematically improve your agent's behavior over time?

In this post, we build a ResearchAgent that answers questions from an internal engineering knowledge base — and gets better at it automatically. The agent uses the Responses API agentic loop with file_search and client-side tools to research questions, and it owns its own system prompt. Every N calls, it benchmarks itself by using a different model to judge the results, and rewrites its own prompt via the Prompts API.

This is literally self-referential: a Llama Stack agent evaluating and improving itself using the Responses API, Prompts API, and Vector Stores as its toolkit.

What We're Building

A single ResearchAgent class that does two things:

Research (agentic): Uses the Responses API while True loop with server-side file_search and client-side function tools (read_local_file, index_document, list_local_files). The agent decides what to search, discovers unindexed local files, reads them, indexes the relevant ones, and searches again with the enriched knowledge base.
Self-improvement (deterministic): Every N calls to research(), the agent runs evaluate_self() to benchmark against test cases and improve_self() to rewrite its own system prompt. This is a fixed sequence — no LLM-driven tool selection, just the agent measuring and improving its own performance.

┌──────────────────────────────────────────────────────────┐
│  ResearchAgent                                           │
│                                                          │
│  research(question)                                      │
│    Responses API agentic loop (while True):              │
│      Server-side: file_search → Vector Store             │
│      Client-side: read_local_file, index_document,       │
│                   list_local_files                       │
│    Increments call counter; triggers self-improvement    │
│    every N calls                                         │
│                                                          │
│  evaluate_self()                                         │
│    Run all test cases → judge answers (Responses API)    │
│    → log scores (SQLite ledger)                          │
│                                                          │
│  improve_self()                                          │
│    Read feedback → propose new prompt (Responses API)    │
│    → save new version (Prompts API)                      │
└──────────────────────────────────────────────────────────┘

Prerequisites

Ollama running locally with two models pulled: llama3.1:8b for the research agent and gpt-oss:20b as the judge
A Llama Stack server using the starter distribution, pointed at Ollama via the OLLAMA_URL environment variable
Python SDK: uv pip install llama-stack-client

The Research Loop

The research agent is the heart of the system — and the showcase for the Responses API agentic pattern. Unlike a simple single-call RAG agent, it has real decisions to make: the vector store might not have enough context, so the agent can discover local files, read them, index the relevant ones, and search again.

It has one server-side tool and three client-side function tools:

file_search (server-side): Searches the vector store for relevant documents. The Responses API executes this automatically — no client code needed.
read_local_file(path): Reads an unindexed local file (e.g., a newly written postmortem not yet in the knowledge base).
index_document(file_path): Uploads a file to the vector store via the Files API and vector_stores.files.create(). This is the key insight: the agent actively curates the knowledge base.
list_local_files(directory): Discovers available .md and .txt files in a directory.

The internal _run_query() method is the standard Responses API agentic loop — keep calling responses.create() until the model stops emitting tool calls:

class ResearchAgent:
    def __init__(self, client, model, vector_store_id, prompt_id, **kwargs):
        self.client = client
        self.model = model
        self.vector_store_id = vector_store_id
        self.prompt_id = prompt_id  # The agent owns its prompt
        self._call_count = 0
        self._tools = {
            "read_local_file": self._read_local_file,
            "index_document": self._index_document,
            "list_local_files": self._list_local_files,
        }
        # Also accepts: judge_model, ledger, test_cases, optimize_every

    def _run_query(self, question, system_prompt):
        """Agentic loop: search, read local files, index, repeat."""
        inputs = question
        tools = self._tool_schemas()

        while True:
            response = self.client.responses.create(
                model=self.model,
                input=inputs,
                instructions=system_prompt,
                tools=tools,
                stream=False,
            )

            # file_search is handled server-side; collect client-side calls
            function_calls = [o for o in response.output if o.type == "function_call"]
            if not function_calls:
                return response.output_text  # Done — no more tool calls

            # Execute each function call and feed results back
            inputs = []
            for fc in function_calls:
                result = self._tools[fc.name](**json.loads(fc.arguments))
                inputs.append(fc)
                inputs.append(
                    {
                        "type": "function_call_output",
                        "call_id": fc.call_id,
                        "output": result,
                    }
                )

The public research() method reads the agent's current prompt, runs the agentic loop, and increments a counter. Every N calls, it triggers self-improvement:

class ResearchAgent:
    ...

    def research(self, question):
        """Answer a question.  Automatically self-improves every N calls."""
        current = self.client.prompts.retrieve(self.prompt_id)
        answer = self._run_query(question, current.prompt)

        self._call_count += 1
        if self.test_cases and self._call_count % self.optimize_every == 0:
            self.evaluate_self()
            self.improve_self()

        return answer

In a typical call, the agent searches the vector store via file_search (handled server-side). If the retrieved context is insufficient — say, a question about a recent outage whose postmortem hasn't been indexed yet — the agent calls list_local_files to discover available documents, read_local_file to inspect the relevant one, and index_document to add it to the vector store. Then it searches again with the enriched store and writes its final answer.

The index_document tool is worth highlighting — it's the agent actively curating its own knowledge base:

class ResearchAgent:
    ...

    def _index_document(self, file_path):
        """Upload a local file to the vector store so it becomes searchable."""
        file = self.client.files.create(
            file=open(file_path, "rb"), purpose="assistants"
        )
        attach = self.client.vector_stores.files.create(
            vector_store_id=self.vector_store_id, file_id=file.id
        )
        while attach.status == "in_progress":
            time.sleep(0.5)
            attach = self.client.vector_stores.files.retrieve(
                vector_store_id=self.vector_store_id, file_id=file.id
            )
        return f"Indexed {file_path} (file_id={file.id}, status={attach.status})"

This uses the Files API to upload the document and vector_stores.files.create() to attach it to the store. After polling until indexing completes, the file is searchable by file_search in subsequent turns of the same query — or in future queries.

Self-Improvement

The self-improvement cycle is where the agent benchmarks itself, then rewrites its own prompt based on the feedback.

evaluate_self

evaluate_self runs the agent on every test case using its current system prompt, judges each answer with the judge model, and logs scores to the ledger:

class ResearchAgent:
    ...

    def evaluate_self(self):
        """Benchmark against test cases and log scores."""
        current = self.client.prompts.retrieve(self.prompt_id)
        results = []
        for tc in self.test_cases:
            answer = self._run_query(tc["question"], current.prompt)
            judgment = self.client.responses.create(
                model=self.judge_model,
                input=(
                    f"Score the following answer on a scale of 0.0 to 1.0.\n\n"
                    f"Question: {tc['question']}\n"
                    f"Expected: {tc['expected']}\nActual: {answer}\n\n"
                    f'Respond with JSON: {{"score": , "reasoning": "..."}}'
                ),
                stream=False,
            )
            score_data = json.loads(judgment.output_text)
            results.append({**tc, "actual": answer, **score_data})

        avg_score = sum(r["score"] for r in results) / len(results)
        self.ledger.log(self.prompt_id, current.version, avg_score, feedback)
        return {"results": results, "average_score": avg_score, "feedback": feedback}

improve_self

improve_self reads the latest evaluation feedback from the ledger and uses the judge model to generate an improved system prompt, then saves it via the Prompts API:

class ResearchAgent:
    ...

    def improve_self(self):
        """Propose and save an improved system prompt."""
        history = self.ledger.history(self.prompt_id)
        latest = history[-1]
        current = self.client.prompts.retrieve(self.prompt_id)

        response = self.client.responses.create(
            model=self.judge_model,
            input=(
                f"Improve this research agent's system prompt based on feedback.\n\n"
                f"Current prompt:\n{current.prompt}\n\n"
                f"Feedback:\n{latest['reasoning']}\n\n"
                f"Return ONLY the improved prompt text."
            ),
            stream=False,
        )
        new_prompt = response.output_text.strip()
        self.client.prompts.update(
            self.prompt_id, prompt=new_prompt, version=current.version
        )

The judge model does double duty — scoring answers and proposing improvements based on its own feedback. The Prompts API auto-increments versions on each update(), and the version parameter provides optimistic locking so concurrent experiments don't silently overwrite each other.

optimize

For initial tuning (before the agent starts serving real queries), optimize runs the evaluate/improve cycle in a for loop:

class ResearchAgent:
    ...

    def optimize(self, max_iterations=5):
        """Run the evaluate/improve cycle for N iterations."""
        for iteration in range(max_iterations):
            self.evaluate_self()
            self.improve_self()

Running It

First, pull the models and start Ollama, then run the Llama Stack starter distribution pointing at it:

ollama pull llama3.1:8b
ollama pull gpt-oss:20b
OLLAMA_URL=http://localhost:11434/v1 uv run --with llama-stack llama stack run starter

The OLLAMA_URL environment variable tells the starter distribution to use Ollama as its inference provider. The server starts on http://localhost:8321 by default.

Then create the agent with some engineering documents. Some docs are indexed in the vector store up front; others live in a local directory for the agent to discover and index on demand:

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://localhost:8321")

# Create the initial system prompt
initial = client.prompts.create(
    prompt="You are a helpful assistant. Answer questions based on the provided context.",
)

# Create the self-improving research agent
agent = ResearchAgent.from_files(
    client,
    model="ollama/llama3.1:8b",
    name="engineering-kb",
    file_paths=[
        "docs/blog/building-agentic-flows/design/user_service_v2.md",
        "docs/blog/building-agentic-flows/runbooks/deployment_rollback.md",
    ],
    prompt_id=initial.prompt_id,
    local_docs_dir="docs/blog/building-agentic-flows/postmortems",
    judge_model="ollama/gpt-oss:20b",
    ledger=ScoreLedger(),
    test_cases=[
        {
            "question": "What is the deployment rollback procedure?",
            "expected": "Revert the Kubernetes deployment to the previous revision using kubectl rollout undo",
        },
        {
            "question": "What authentication method does the user service use?",
            "expected": "JWT tokens issued by the auth gateway with RS256 signing",
        },
        {
            "question": "What was the root cause of the 2025-02 checkout outage?",
            "expected": "Connection pool exhaustion in the payments service due to missing timeout configuration",
        },
    ],
    optimize_every=10,
)

# Run an initial optimization pass
agent.optimize(max_iterations=5)

# Show the best prompt
result = agent.best_prompt()
print(f"Best prompt (v{result['version']}, score={result['score']:.2f}):")
print(f"  {result['prompt']}")

# Normal usage — the agent self-improves every 10 research() calls
answer = agent.research("What is the deployment rollback procedure?")
print(f"Agent says: {answer}")

The full implementation with tool schema generation and all supporting code is available at self_improving_agent.py.

How It Works Under the Hood

The agent uses both kinds of Responses API tools for research:

Server-side tools like file_search are executed automatically — the Responses API searches the vector store, retrieves relevant chunks, and feeds them to the model without any client code. This is what makes knowledge base search a single API call.
Client-side function tools (read_local_file, index_document, list_local_files) return tool call objects for you to execute. The while True loop dispatches these, and the results feed back into the next responses.create() call. This is what lets the agent actively curate its knowledge base.

The agent combines both in a single loop: file_search results come back automatically within the response, while function calls need client-side execution. The model sees both sources of information and decides what to do next.

The self-improvement methods don't need any of this machinery. They call responses.create() directly for judging and prompt generation — no tool calling, no agentic loop. The Prompts API stores versioned prompt text with optimistic locking, and the SQLite ledger tracks how well each version performed. The research() counter ties it all together: the agent serves queries normally, and every N calls it pauses to evaluate and improve itself.

What's Next

The pattern here — a self-improving agent that benchmarks and rewrites its own prompt — generalizes well beyond research assistants:

MCP tools for connecting to external services (databases, APIs, code execution sandboxes) — the research agent could pull in live data alongside static documents
Web search alongside file_search for agents that combine local knowledge with live web results
Multiple research agents with different vector stores, each self-improving independently and specializing in a different knowledge domain

To learn more:

How to Get Started with Llama Stack

2026-01-30T00:00:00.000Z

There is no shortage of GenAI hosted services like OpenAI, Gemini, and Bedrock.

Often, these services require tailoring your GenAI application directly to them, requiring developers to consider things that have nothing to do with their applications. Llama Stack is an open source project aiming to standardize and offer a set of APIs for AI applications that stay the same, regardless of the backend services being provided via those APIs.

Llama Stack’s APIs allow for a variety of use cases from running inference with Ollama on your laptop to a self-managed GPU system running inference with vLLM to a pure SaaS-based solution like Vertex. The standardized set of APIs each have providers that follow the same REST API implementation. An admin of the stack can specify which provider they want for each API and expose the REST API to users who get the same frontend experience regardless of the provider. This can allow you to run a single API surface layer using whatever Inference, Vector IO, or other solutions you may want while keeping your GenAI applications simple.

A Llama Stack is defined by its config.yaml file which holds key information like which APIs you want to expose, which providers you want to initialize for those APIs, their configuration, and more. Llama Stack features a CLI that allows you to launch and manage servers, either run locally on your machine or in a container!

Here is a sample portion of a config.yaml:

version: 2
distro_name: starter
apis:
- agents
- batches
- datasetio
- eval
- files
- inference
- post_training
- safety
- scoring
- tool_runtime
- vector_io
providers:
inference:
- provider_id: ${env.OLLAMA_URL:+ollama}
provider_type: remote::ollama
config:
base_url: ${env.OLLAMA_URL:=http://localhost:11434/v1}
...

The current set of Llama Stack APIs can be found here: https://llamastack.github.io/docs/api-overview

All of these APIs, if set in the config.yaml which defines the stack to be stood up, will be available via a REST API with their initialized providers. Each API can have one or more providers and the provider_id can be specified at request time.

To get started quickly, all you need is Llama Stack, Ollama, and your favorite inference model! For this example, we are using gpt-oss:20b.

If you already have Ollama installed as a service, you can simply pull the model:

ollama pull gpt-oss:20b
uv run --with llama-stack llama stack list-deps --providers inference=remote::ollama --format uv | sh
uv run --with llama-stack llama stack run --providers inference=remote::ollama

If you don't have Ollama running as a service, you can start it manually:

ollama serve > /dev/null 2>&1 &
ollama run gpt-oss:20b --keepalive 60m # you can exit this once the model is running due to --keepalive
uv run --with llama-stack llama stack --providers inference=remote::ollama --format uv | sh
uv run --with llama-stack llama stack run --providers inference=remote::ollama

Now you have Ollama running with gpt-oss:20b, and Llama Stack running pointing to Ollama as the inference provider. This minimal setup is sufficient to connect to local Ollama and respond to /v1/chat/completions requests.

For a more feature-rich setup, you can use the starter distribution which gives you a full stack with additional APIs and providers:

ollama serve > /dev/null 2>&1 &
ollama run gpt-oss:20b --keepalive 60m # you can exit this once the model is running due to --keepalive
uv run --with llama-stack llama stack list-deps starter --format uv | sh
export OLLAMA_URL=http://localhost:11434/v1
uv run --with llama-stack llama stack run starter

A sample chat completion request would look like this:

curl -X POST http://localhost:8321/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/gpt-oss:20b",
"messages": [{"role": "user", "content": "Hello!"}]
}'

Notice, the model name must be prefixed with the provider_id in order for the request to route properly! In this example, we are just utilizing the /chat/completions route in the Inference API. The starter distribution has a large amount of APIs and ready to use providers baked in. Example API requests, similar to the one above, for other APIs can be found in the Llama Stack API specification. Take it for a spin and see what you can do with Llama Stack!

Introducing Llama Stack - The Open-Source Platform for Building AI Applications

2026-01-22T00:00:00.000Z

Welcome to our blog!

We're excited to introduce you to Llama Stack - the open-source platform that simplifies building production-ready generative AI applications.

What is Llama Stack?

Llama Stack defines and standardizes the core building blocks needed to bring generative AI applications to market, centered on the Open Responses specification. By aligning with OpenAI’s open-sourced Responses API, Llama Stack provides a consistent, interoperable foundation for building agentic and generative systems. It offers a growing suite of open-source APIs—including prompts, conversations, files, models, embeddings, fine-tuning, and MCP—enabling seamless transitions from local development to production across providers and environments.

Think of Llama Stack as a universal interface that abstracts away the complexity of working with different AI tools and provider (e.g., vector databases, model inference providers, and deployment environments). Whether you're building locally, deploying on-premises, or scaling in the cloud, Llama Stack provides a consistent developer experience.

Key Features

Unified API Layer

Llama Stack provides standardized APIs across six core capabilities:

Inference: Run models locally or in the cloud with a consistent interface
Vector Stores: Build knowledge and agentic retrieval systems
Agents: Create intelligent agent flows with responses/conversations
Tools and MCP: Integrate with external tools and services directly or via MCP
Moderations: Built-in safety guardrails and content filtering via moderations api

Plugin Architecture

The plugin architecture supports a rich ecosystem of API implementations across different environments:

Local Development: Start with CPU-only setups for rapid iteration
On-Premises: Deploy in your own infrastructure
Cloud: Scale with hosted providers

Prepackaged Distributions

Distributions are pre-configured bundles of provider implementations that make it easy to get started. You can begin with a local setup using Ollama and seamlessly transition to production with vLLM - all without changing your application code.

Multiple Developer Interfaces

Llama Stack supports various developer interfaces:

CLI: Command-line tools for server management
Python SDK: llama-stack-client-python
TypeScript SDK: llama-stack-client-typescript

Why Llama Stack?

Flexibility Without Compromise

Developers can choose their preferred infrastructure without changing APIs. This means you can:

Start locally for development
Test with different providers
Deploy to production with your chosen infrastructure
Switch providers as your needs evolve

All while maintaining the same codebase and APIs.

Consistent Experience

With unified APIs, Llama Stack makes it easier to:

Build applications with consistent behavior
Test across different environments
Deploy with confidence
Maintain and update your codebase

Robust Ecosystem

Llama Stack integrates with distribution partners including:

Cloud Providers: AWS Bedrock, Together, Fireworks, and more
Hardware Vendors: NVIDIA, Cerebras, SambaNova
Vector Databases: ChromaDB, Milvus, Qdrant, Weaviate, PostgreSQL, ElasticSearch
AI Companies: OpenAI, Anthropic, Google Gemini

For a complete list, check out our Providers Documentation.

How It Works

Llama Stack consists of two main components:

Server: A server with pluggable API providers that can run in various environments
Client SDKs: Libraries for your applications to interact with the server

The server handles all the complexity of managing different providers, while the client SDKs provide a simple, consistent interface for your application code.

Refer to the Quick Start Guide to get started building your first AI application with Llama Stack.

What's Next?

See the Llama Stack Office Hours Content Calendar for upcoming topics and the blog roadmap.

Join the Community

We'd love to have you join our growing community:

Conclusion

Llama Stack is designed to make building AI applications simpler, more flexible, and more maintainable. By providing unified APIs and a rich ecosystem of providers, we're enabling developers to focus on what matters most - building great applications.

Whether you're just getting started with AI or building production systems at scale, Llama Stack has something to offer. We're excited to see what you'll build!

Llama Stack Blog

Tracing LlamaStack Applications with MLflow: SDK vs OTel Collector

MLflow Tracing: A Quick Overview​

LlamaStack and Its OpenAI-Compatible API​

Architecture Overview​

Prerequisites​

Start the MLflow Tracking Server​

Approach 1: Tracing via MLflow SDK​

Step 1: Instrument Your Code​

Step 2: Run the Application​

Step 3: View Traces in MLflow​

Pros and Cons​

Approach 2: Tracing via OTel Collector​

Step 1: Install OpenTelemetry Dependencies​

Step 2: Configure the OTel Collector​

Step 3: Run the OTel Collector​

Step 4: Run the Application with OTel Instrumentation​

Step 5: Verify Traces​

Pros and Cons​

Responses API Support​

Approach Comparison​

Bonus: Tracing the LlamaStack Server Itself​

Common Gotchas​

Conclusion​

Quick Start With Containers​

References​

Llama Stack Observability: Metrics, Traces, and Dashboards with OpenTelemetry

Why Observability Matters for LLM Applications​

How We Instrumented Llama Stack​

Auto instrumentation: the infrastructure view​

Manual instrumentation: the application view​

The OpenTelemetry Collector​

End-to-End Data Flow​

Hands-On Tutorial​

Step 0: Prerequisites​

Step 1: Deploy the Observability Stack​

Step 2: Install OpenTelemetry Packages​

Step 3: Launch the Server​

Step 4: Launch Your Client​

Step 5: Explore the Data​

Jaeger: Distributed Traces​

Prometheus: Metrics Queries​

Grafana: Pre-built Dashboard​

Step 6: Set Up Alerts​

Cleanup​

What's Next​

Llama Stack Achieves 100% Open Responses Compliance: Enterprise-Grade OpenAI Compatibility for Your Infrastructure

Recognition by the Open Responses Community​

Comprehensive OpenAI API Feature Support​

Files API - OpenAI-Compatible Document Management​

Vector Stores API - RAG Without Vendor Lock-in​

Conversations API - Persistent Context Management​

Chat Completions & Responses - Simple Chat to Agentic Workflows​

Prompts API - Programmatic Prompt Management​

MCP Integration - Extensible Tool Ecosystem​

Connectors - Declarative Service Integration​

The Value Proposition: SaaS Experience, Your Infrastructure​

Data Sovereignty & Security​

Cost Control & Predictability​

Model Freedom​

Getting Started in Minutes​

Local Development​

Production Deployment​

Framework Ecosystem Compatibility​

Direct OpenAI Client​

LangChain Integration​

Native Llama Stack Client​

Built for Open Standards​

Technical Excellence Through Testing​

What's Next​

Join the Open AI Infrastructure Movement​

Your Agent, Your Rules: Building Powerful Agents with the Responses API in Llama Stack

Why the Responses API?​

What Llama Stack brings to the table​

Model freedom​

Data sovereignty​

Open, extensible architecture​

Private RAG with file_search​

Multi-tool orchestration with MCP​

Framework compatibility​

MLflow Tracing: A Quick Overview

LlamaStack and Its OpenAI-Compatible API

Architecture Overview

Prerequisites

Start the MLflow Tracking Server

Approach 1: Tracing via MLflow SDK

Step 1: Instrument Your Code

Step 2: Run the Application

Step 3: View Traces in MLflow

Pros and Cons

Approach 2: Tracing via OTel Collector

Step 1: Install OpenTelemetry Dependencies

Step 2: Configure the OTel Collector

Step 3: Run the OTel Collector

Step 4: Run the Application with OTel Instrumentation

Step 5: Verify Traces

Pros and Cons

Responses API Support

Approach Comparison

Bonus: Tracing the LlamaStack Server Itself

Common Gotchas

Conclusion

Quick Start With Containers

References

Why Observability Matters for LLM Applications

How We Instrumented Llama Stack

Auto instrumentation: the infrastructure view

Manual instrumentation: the application view

The OpenTelemetry Collector

End-to-End Data Flow

Hands-On Tutorial

Step 0: Prerequisites

Step 1: Deploy the Observability Stack

Step 2: Install OpenTelemetry Packages

Step 3: Launch the Server

Step 4: Launch Your Client

Step 5: Explore the Data

Jaeger: Distributed Traces

Prometheus: Metrics Queries

Grafana: Pre-built Dashboard

Step 6: Set Up Alerts

Cleanup

What's Next

Recognition by the Open Responses Community

Comprehensive OpenAI API Feature Support

Files API - OpenAI-Compatible Document Management

Vector Stores API - RAG Without Vendor Lock-in

Conversations API - Persistent Context Management

Chat Completions & Responses - Simple Chat to Agentic Workflows

Prompts API - Programmatic Prompt Management

MCP Integration - Extensible Tool Ecosystem

Connectors - Declarative Service Integration

The Value Proposition: SaaS Experience, Your Infrastructure

Data Sovereignty & Security

Cost Control & Predictability

Model Freedom

Getting Started in Minutes

Local Development

Production Deployment

Framework Ecosystem Compatibility

Direct OpenAI Client

LangChain Integration

Native Llama Stack Client

Built for Open Standards

Technical Excellence Through Testing

What's Next

Join the Open AI Infrastructure Movement

Why the Responses API?

What Llama Stack brings to the table

Model freedom

Data sovereignty

Open, extensible architecture

Private RAG with `file_search`

Multi-tool orchestration with MCP

Framework compatibility

Toward an open standard: Open Responses

Getting started

What We're Building

Prerequisites

The Research Loop