<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://llamastack.github.io/blog</id>
    <title>Llama Stack Blog</title>
    <updated>2026-04-06T00:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://llamastack.github.io/blog"/>
    <subtitle>Blog posts about Llama Stack</subtitle>
    <icon>https://llamastack.github.io/img/favicon.ico</icon>
    <rights>Copyright © 2026 Meta Platforms, Inc.</rights>
    <entry>
        <title type="html"><![CDATA[Tracing LlamaStack Applications with MLflow: SDK vs OTel Collector]]></title>
        <id>https://llamastack.github.io/blog/mlflow-observability</id>
        <link href="https://llamastack.github.io/blog/mlflow-observability"/>
        <updated>2026-04-06T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[As LLM-powered applications grow in complexity, observability becomes essential. You need to understand what your application is doing — what prompts are being sent, what responses come back, how long each call takes, and how many tokens are consumed. MLflow provides a powerful tracing framework that captures all of this, which can be integrated with llamastack for observability.]]></summary>
        <content type="html"><![CDATA[<p>As LLM-powered applications grow in complexity, observability becomes essential. You need to understand what your application is doing — what prompts are being sent, what responses come back, how long each call takes, and how many tokens are consumed. <a href="https://mlflow.org/" target="_blank" rel="noopener noreferrer" class="">MLflow</a> provides a powerful tracing framework that captures all of this, which can be integrated with <a href="https://github.com/llamastack/llama-stack" target="_blank" rel="noopener noreferrer" class="">llamastack</a> for observability.</p>
<p>In this post, we'll walk through two approaches for exporting LlamaStack traces into MLflow:</p>
<ol>
<li class=""><strong>MLflow SDK</strong> — Direct instrumentation using MLflow's built-in tracing and autologging</li>
<li class=""><strong>OTel Collector</strong> — Decoupled telemetry pipeline using OpenTelemetry auto-instrumentation and an OTel Collector as the intermediary</li>
</ol>
<p>By the end, you'll understand when to use each approach and how to set them up.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="mlflow-tracing-a-quick-overview">MLflow Tracing: A Quick Overview<a href="https://llamastack.github.io/blog/mlflow-observability#mlflow-tracing-a-quick-overview" class="hash-link" aria-label="Direct link to MLflow Tracing: A Quick Overview" title="Direct link to MLflow Tracing: A Quick Overview" translate="no">​</a></h2>
<p>MLflow is an open-source platform for managing the ML lifecycle. Starting with version 2.14, MLflow introduced <strong>GenAI tracing</strong> — a first-class feature for capturing LLM interactions including:</p>
<ul>
<li class=""><strong>Traces and Spans</strong>: Hierarchical representation of operations (API calls, tool invocations, chain steps)</li>
<li class=""><strong>Token Usage &amp; Latency</strong>: Automatic capture of input/output tokens and response times</li>
<li class=""><strong>Input/Output Logging</strong>: Full request and response payloads for debugging</li>
<li class=""><strong>Web UI</strong>: A built-in dashboard for exploring, filtering, and analyzing traces</li>
</ul>
<p>MLflow also supports ingesting traces via the <strong>OpenTelemetry (OTLP) protocol</strong> at its <code>/v1/traces</code> endpoint, which opens the door to vendor-neutral instrumentation, more on that in the OTel Collector section.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="llamastack-and-its-openai-compatible-api">LlamaStack and Its OpenAI-Compatible API<a href="https://llamastack.github.io/blog/mlflow-observability#llamastack-and-its-openai-compatible-api" class="hash-link" aria-label="Direct link to LlamaStack and Its OpenAI-Compatible API" title="Direct link to LlamaStack and Its OpenAI-Compatible API" translate="no">​</a></h2>
<p>LlamaStack provides an OpenAI-compatible API, meaning any tooling that works with OpenAI's chat completions API or responses API also works with LlamaStack. This is key for tracing, we can leverage existing OpenAI instrumentation libraries (both MLflow's <code>openai.autolog()</code> and OpenTelemetry's <code>opentelemetry-instrumentation-openai-v2</code>) to capture traces without writing custom code.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="architecture-overview">Architecture Overview<a href="https://llamastack.github.io/blog/mlflow-observability#architecture-overview" class="hash-link" aria-label="Direct link to Architecture Overview" title="Direct link to Architecture Overview" translate="no">​</a></h2>
<p>Before diving into the details, here's a high-level view of both approaches:</p>
<!-- -->
<!-- -->
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="prerequisites">Prerequisites<a href="https://llamastack.github.io/blog/mlflow-observability#prerequisites" class="hash-link" aria-label="Direct link to Prerequisites" title="Direct link to Prerequisites" translate="no">​</a></h2>
<p>For both approaches, you'll need:</p>
<ul>
<li class=""><strong>Python 3.10+</strong></li>
<li class=""><strong>A running LlamaStack server</strong> (e.g., at <code>http://localhost:8321</code>)</li>
<li class=""><strong>MLflow &gt;= 3.10</strong> with GenAI extras:</li>
</ul>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">pip </span><span class="token function" style="color:hsl(207, 82%, 66%)">install</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"mlflow[genai]&gt;=3.10"</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"openai&gt;=2.20.0"</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="start-the-mlflow-tracking-server">Start the MLflow Tracking Server<a href="https://llamastack.github.io/blog/mlflow-observability#start-the-mlflow-tracking-server" class="hash-link" aria-label="Direct link to Start the MLflow Tracking Server" title="Direct link to Start the MLflow Tracking Server" translate="no">​</a></h3>
<p>Launch a local MLflow server with SQLite as the backend store:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">mlflow server </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">  --backend-store-uri sqlite:///mlflow.db </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">  --default-artifact-root ./mlruns </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">  </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--host</span><span class="token plain"> </span><span class="token number" style="color:hsl(29, 54%, 61%)">0.0</span><span class="token plain">.0.0 </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--port</span><span class="token plain"> </span><span class="token number" style="color:hsl(29, 54%, 61%)">5000</span><br></span></code></pre></div></div>
<p>The MLflow UI will be available at <a href="http://localhost:5000/" target="_blank" rel="noopener noreferrer" class="">http://localhost:5000</a>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="approach-1-tracing-via-mlflow-sdk">Approach 1: Tracing via MLflow SDK<a href="https://llamastack.github.io/blog/mlflow-observability#approach-1-tracing-via-mlflow-sdk" class="hash-link" aria-label="Direct link to Approach 1: Tracing via MLflow SDK" title="Direct link to Approach 1: Tracing via MLflow SDK" translate="no">​</a></h2>
<p>This approach uses MLflow's native tracing SDK to capture and export traces directly to the MLflow server. It's the simplest way to get started.</p>
<!-- -->
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-1-instrument-your-code">Step 1: Instrument Your Code<a href="https://llamastack.github.io/blog/mlflow-observability#step-1-instrument-your-code" class="hash-link" aria-label="Direct link to Step 1: Instrument Your Code" title="Direct link to Step 1: Instrument Your Code" translate="no">​</a></h3>
<p>Add MLflow tracing to your LlamaStack client code. The example below uses the <a href="https://platform.openai.com/docs/api-reference/responses" target="_blank" rel="noopener noreferrer" class="">Responses API</a> (<code>client.responses.create</code>), which is the recommended way to interact with LlamaStack:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token keyword" style="color:hsl(286, 60%, 67%)">import</span><span class="token plain"> mlflow</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token keyword" style="color:hsl(286, 60%, 67%)">import</span><span class="token plain"> mlflow</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">tracing </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">as</span><span class="token plain"> mlflow_tracing</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token keyword" style="color:hsl(286, 60%, 67%)">from</span><span class="token plain"> openai </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">import</span><span class="token plain"> OpenAI</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Configure MLflow</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">mlflow</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">set_tracking_uri</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token string" style="color:hsl(95, 38%, 62%)">"http://localhost:5000"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">mlflow</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">set_experiment</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token string" style="color:hsl(95, 38%, 62%)">"LlamaStack Demo"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Enable tracing and OpenAI autologging</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">mlflow_tracing</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">enable</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">mlflow</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">openai</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">autolog</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Create an OpenAI-compatible client pointing to LlamaStack</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">client </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> OpenAI</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    base_url</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"http://localhost:8321/v1"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    api_key</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"fake"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">response </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">responses</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"meta-llama/Llama-3.1-8B-Instruct"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token builtin" style="color:hsl(95, 38%, 62%)">input</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"Give a one-sentence description of LlamaStack."</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token keyword" style="color:hsl(286, 60%, 67%)">print</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">response</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">output_text</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<p>MLflow's <code>openai.autolog()</code> automatically captures every <code>client.responses.create()</code> call as a trace, including inputs, outputs, token usage, and latency — with zero additional instrumentation code.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-2-run-the-application">Step 2: Run the Application<a href="https://llamastack.github.io/blog/mlflow-observability#step-2-run-the-application" class="hash-link" aria-label="Direct link to Step 2: Run the Application" title="Direct link to Step 2: Run the Application" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">MLFLOW_TRACKING_URI</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">http://localhost:5000 </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">python your_app.py</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-3-view-traces-in-mlflow">Step 3: View Traces in MLflow<a href="https://llamastack.github.io/blog/mlflow-observability#step-3-view-traces-in-mlflow" class="hash-link" aria-label="Direct link to Step 3: View Traces in MLflow" title="Direct link to Step 3: View Traces in MLflow" translate="no">​</a></h3>
<p>Open <a href="http://localhost:5000/" target="_blank" rel="noopener noreferrer" class="">http://localhost:5000</a>, navigate to your experiment ("LlamaStack Demo"), and click the <strong>Traces</strong> tab. You'll see each request with:</p>
<ul>
<li class="">Full input/output payloads</li>
<li class="">Token usage (input, output, total)</li>
<li class="">Latency breakdown</li>
<li class="">Span hierarchy</li>
</ul>
<p><img decoding="async" loading="lazy" alt="MLflow distributed traces for LlamaStack via SDK" src="https://llamastack.github.io/assets/images/mlflow-sdk-11700f9a7b21fd5010f6feaf34faf951.png" width="3016" height="1798" class="img_ev3q"></p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="pros-and-cons">Pros and Cons<a href="https://llamastack.github.io/blog/mlflow-observability#pros-and-cons" class="hash-link" aria-label="Direct link to Pros and Cons" title="Direct link to Pros and Cons" translate="no">​</a></h3>

























<table><thead><tr><th>Aspect</th><th>Details</th></tr></thead><tbody><tr><td><strong>Simplicity</strong></td><td>Minimal setup — just a few lines of Python</td></tr><tr><td><strong>Rich data</strong></td><td>MLflow autolog captures detailed OpenAI-specific metadata</td></tr><tr><td><strong>Coupling</strong></td><td>Application code depends on <code>mlflow</code> package</td></tr><tr><td><strong>Flexibility</strong></td><td>Traces go directly to MLflow — no intermediary routing or fan-out</td></tr></tbody></table>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="approach-2-tracing-via-otel-collector">Approach 2: Tracing via OTel Collector<a href="https://llamastack.github.io/blog/mlflow-observability#approach-2-tracing-via-otel-collector" class="hash-link" aria-label="Direct link to Approach 2: Tracing via OTel Collector" title="Direct link to Approach 2: Tracing via OTel Collector" translate="no">​</a></h2>
<p>This approach decouples instrumentation from the trace backend. The application uses <strong>OpenTelemetry auto-instrumentation</strong> to emit spans, which flow through an <strong>OTel Collector</strong> before being forwarded to MLflow's OTLP endpoint.</p>
<!-- -->
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-1-install-opentelemetry-dependencies">Step 1: Install OpenTelemetry Dependencies<a href="https://llamastack.github.io/blog/mlflow-observability#step-1-install-opentelemetry-dependencies" class="hash-link" aria-label="Direct link to Step 1: Install OpenTelemetry Dependencies" title="Direct link to Step 1: Install OpenTelemetry Dependencies" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">pip </span><span class="token function" style="color:hsl(207, 82%, 66%)">install</span><span class="token plain"> opentelemetry-api </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            opentelemetry-sdk </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            opentelemetry-exporter-otlp </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            opentelemetry-instrumentation-openai</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-2-configure-the-otel-collector">Step 2: Configure the OTel Collector<a href="https://llamastack.github.io/blog/mlflow-observability#step-2-configure-the-otel-collector" class="hash-link" aria-label="Direct link to Step 2: Configure the OTel Collector" title="Direct link to Step 2: Configure the OTel Collector" translate="no">​</a></h3>
<p>Create an <code>otel-collector-config.yaml</code>:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token key atrule" style="color:hsl(29, 54%, 61%)">receivers</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">  </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">otlp</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">protocols</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">      </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">http</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">endpoint</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> 0.0.0.0</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token number" style="color:hsl(29, 54%, 61%)">4318</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">exporters</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">  </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">otlphttp/mlflow</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">endpoint</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> http</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain">//host.docker.internal</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token number" style="color:hsl(29, 54%, 61%)">5000</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">traces_endpoint</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> /v1/traces</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">tls</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">      </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">insecure</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token boolean important" style="color:hsl(220, 14%, 71%);font-weight:bold">true</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">headers</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">      </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">x-mlflow-experiment-id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">processors</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">  </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">batch</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">timeout</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> 5s</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">service</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">  </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">pipelines</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">traces</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">      </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">receivers</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token plain">otlp</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">      </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">processors</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token plain">batch</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">      </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">exporters</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token plain">otlphttp/mlflow</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><br></span></code></pre></div></div>
<p>Key configuration points:</p>
<ul>
<li class=""><strong><code>receivers.otlp</code></strong>: Accepts OTLP data on port 4318 (HTTP)</li>
<li class=""><strong><code>exporters.otlphttp/mlflow</code></strong>: Forwards traces to MLflow's <code>/v1/traces</code> OTLP endpoint</li>
<li class=""><strong><code>x-mlflow-experiment-id</code></strong>: Determines which MLflow experiment receives the traces</li>
<li class=""><strong><code>host.docker.internal</code></strong>: Allows the containerized collector to reach the host machine's MLflow server</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-3-run-the-otel-collector">Step 3: Run the OTel Collector<a href="https://llamastack.github.io/blog/mlflow-observability#step-3-run-the-otel-collector" class="hash-link" aria-label="Direct link to Step 3: Run the OTel Collector" title="Direct link to Step 3: Run the OTel Collector" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token function" style="color:hsl(207, 82%, 66%)">docker</span><span class="token plain"> run </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">-d</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">  </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--name</span><span class="token plain"> otel-collector </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">  </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">-p</span><span class="token plain"> </span><span class="token number" style="color:hsl(29, 54%, 61%)">4318</span><span class="token plain">:4318 </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">  </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">-v</span><span class="token plain"> </span><span class="token variable" style="color:hsl(207, 82%, 66%)">$(</span><span class="token variable builtin class-name" style="color:hsl(29, 54%, 61%)">pwd</span><span class="token variable" style="color:hsl(207, 82%, 66%)">)</span><span class="token plain">/otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">  otel/opentelemetry-collector-contrib:0.143.1</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-4-run-the-application-with-otel-instrumentation">Step 4: Run the Application with OTel Instrumentation<a href="https://llamastack.github.io/blog/mlflow-observability#step-4-run-the-application-with-otel-instrumentation" class="hash-link" aria-label="Direct link to Step 4: Run the Application with OTel Instrumentation" title="Direct link to Step 4: Run the Application with OTel Instrumentation" translate="no">​</a></h3>
<p>The key difference here: we use <code>opentelemetry-instrument</code> to wrap the application, and we <strong>disable</strong> MLflow's built-in tracing to avoid double-writing:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">MLFLOW_ENABLE_TRACING</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token number" style="color:hsl(29, 54%, 61%)">0</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_EXPORTER_OTLP_ENDPOINT</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">http://localhost:4318 </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_EXPORTER_OTLP_PROTOCOL</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">http/protobuf </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_SERVICE_NAME</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">llamastack-app </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">opentelemetry-instrument python your_app.py</span><br></span></code></pre></div></div>
<p>Note that the application code itself does <strong>not</strong> need any MLflow imports for tracing. The OpenTelemetry auto-instrumentation handles span creation, and the collector handles routing.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-5-verify-traces">Step 5: Verify Traces<a href="https://llamastack.github.io/blog/mlflow-observability#step-5-verify-traces" class="hash-link" aria-label="Direct link to Step 5: Verify Traces" title="Direct link to Step 5: Verify Traces" translate="no">​</a></h3>
<p>Check that traces are flowing into MLflow:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token comment" style="color:hsl(220, 10%, 40%)"># Via CLI</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">MLFLOW_TRACKING_URI</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">http://localhost:5000 </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">mlflow traces search --experiment-id </span><span class="token number" style="color:hsl(29, 54%, 61%)">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Or check the database directly</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">sqlite3 mlflow.db </span><span class="token string" style="color:hsl(95, 38%, 62%)">"SELECT count(*) FROM trace_info;"</span><br></span></code></pre></div></div>
<p>Then open the MLflow UI at <a href="http://localhost:5000/" target="_blank" rel="noopener noreferrer" class="">http://localhost:5000</a> to explore the traces visually.</p>
<p><img decoding="async" loading="lazy" alt="MLflow distributed traces for LlamaStack via OTEL" src="https://llamastack.github.io/assets/images/otel-mlflow-4a02b67705a534662a67b84beac56fdc.png" width="3006" height="1794" class="img_ev3q"></p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="pros-and-cons-1">Pros and Cons<a href="https://llamastack.github.io/blog/mlflow-observability#pros-and-cons-1" class="hash-link" aria-label="Direct link to Pros and Cons" title="Direct link to Pros and Cons" translate="no">​</a></h3>





























<table><thead><tr><th>Aspect</th><th>Details</th></tr></thead><tbody><tr><td><strong>Decoupling</strong></td><td>App has no MLflow SDK dependency — only standard OTel</td></tr><tr><td><strong>Fan-out</strong></td><td>Collector can export to multiple backends simultaneously (e.g., Jaeger + MLflow)</td></tr><tr><td><strong>Production-ready</strong></td><td>OTel Collector provides buffering, retry, and batching</td></tr><tr><td><strong>Complexity</strong></td><td>Requires running and configuring an additional service (the collector)</td></tr><tr><td><strong>Data richness</strong></td><td>OTel OpenAI instrumentation may capture different fields than MLflow autolog</td></tr></tbody></table>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="responses-api-support">Responses API Support<a href="https://llamastack.github.io/blog/mlflow-observability#responses-api-support" class="hash-link" aria-label="Direct link to Responses API Support" title="Direct link to Responses API Support" translate="no">​</a></h3>
<p><strong>Note:</strong> OpenTelemetry auto-instrumentation for the Responses API is not yet available upstream. Progress is tracked in <a href="https://github.com/llamastack/llama-stack/issues/5192" target="_blank" rel="noopener noreferrer" class="">llamastack/llama-stack#5192</a>. In the meantime, Approach 1 (MLflow SDK) fully supports tracing Responses API calls via <code>mlflow.openai.autolog()</code>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="approach-comparison">Approach Comparison<a href="https://llamastack.github.io/blog/mlflow-observability#approach-comparison" class="hash-link" aria-label="Direct link to Approach Comparison" title="Direct link to Approach Comparison" translate="no">​</a></h2>
<!-- -->
<!-- -->








































<table><thead><tr><th>Criteria</th><th>MLflow SDK</th><th>OTel Collector</th></tr></thead><tbody><tr><td><strong>Setup complexity</strong></td><td>Low — a few lines of code</td><td>Medium — collector config + container</td></tr><tr><td><strong>Code coupling</strong></td><td>Coupled to <code>mlflow</code> package</td><td>No MLflow dependency in app code</td></tr><tr><td><strong>Multi-backend support</strong></td><td>MLflow only</td><td>Fan-out to any OTLP-compatible backend</td></tr><tr><td><strong>Buffering &amp; retry</strong></td><td>Basic (in-process)</td><td>Production-grade (collector handles it)</td></tr><tr><td><strong>Best for</strong></td><td>Development, prototyping, quick experiments</td><td>Production deployments, multi-tool observability stacks</td></tr><tr><td><strong>Instrumentation</strong></td><td><code>mlflow.openai.autolog()</code></td><td><code>opentelemetry-instrument</code> + OTel OpenAI plugin</td></tr></tbody></table>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="bonus-tracing-the-llamastack-server-itself">Bonus: Tracing the LlamaStack Server Itself<a href="https://llamastack.github.io/blog/mlflow-observability#bonus-tracing-the-llamastack-server-itself" class="hash-link" aria-label="Direct link to Bonus: Tracing the LlamaStack Server Itself" title="Direct link to Bonus: Tracing the LlamaStack Server Itself" translate="no">​</a></h2>
<p>Both approaches above trace the <strong>client side</strong> — the application making calls to LlamaStack. But you can also trace the <strong>LlamaStack server</strong> using the OTel approach:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_EXPORTER_OTLP_TRACES_ENDPOINT</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">http://localhost:5000/v1/traces </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_EXPORTER_OTLP_TRACES_PROTOCOL</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">http/protobuf </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_EXPORTER_OTLP_TRACES_HEADERS</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"x-mlflow-experiment-id=1"</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_SERVICE_NAME</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">llama-stack-server </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">opentelemetry-instrument llama stack run starter</span><br></span></code></pre></div></div>
<p>This gives you end-to-end visibility: client-side spans showing the request lifecycle, and server-side spans showing internal LlamaStack processing.</p>
<blockquote>
<p><strong>Important:</strong> MLflow SDK tracing only instruments the <strong>client side</strong>. The LlamaStack server itself is not instrumented by MLflow, so server-side spans (inference routing, tool execution, etc.) are only visible through OpenTelemetry auto-instrumentation (Approach 2).</p>
</blockquote>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="common-gotchas">Common Gotchas<a href="https://llamastack.github.io/blog/mlflow-observability#common-gotchas" class="hash-link" aria-label="Direct link to Common Gotchas" title="Direct link to Common Gotchas" translate="no">​</a></h2>
<ol>
<li class="">
<p><strong>MLflow OTLP endpoint path</strong>: Use <code>/v1/traces</code>, not <code>/api/2.0/otlp/v1/traces</code> (the latter returns 404 in MLflow 3.10+).</p>
</li>
<li class="">
<p><strong>Double-writing</strong>: If you enable both MLflow autolog and OTel instrumentation, traces may be written twice. Set <code>MLFLOW_ENABLE_TRACING=0</code> when using the OTel path.</p>
</li>
<li class="">
<p><strong>Missing spans with OTel</strong>: The <code>opentelemetry-instrument</code> wrapper is required — simply setting <code>OTEL_*</code> environment variables without it won't produce any spans because no instrumentation is active.</p>
</li>
<li class="">
<p><strong>Docker networking</strong>: When running the OTel Collector in Docker, use <code>host.docker.internal</code> to reach services on the host machine.</p>
</li>
<li class="">
<p><strong>Time range in UI</strong>: If the MLflow UI looks empty, check the time range filter — it may default to a narrow window that excludes your traces.</p>
</li>
</ol>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conclusion">Conclusion<a href="https://llamastack.github.io/blog/mlflow-observability#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>Both approaches get your LlamaStack traces into MLflow, but they serve different needs:</p>
<ul>
<li class=""><strong>Start with the MLflow SDK</strong> when you want quick, low-friction observability during development. A few lines of code and you're tracing.</li>
<li class=""><strong>Move to the OTel Collector</strong> when you need production-grade telemetry infrastructure — decoupled from your application, with the ability to fan out to multiple observability backends.</li>
</ul>
<p>The good news: since LlamaStack exposes an OpenAI-compatible API, both paths leverage existing, well-maintained instrumentation libraries. You're not writing custom tracing code — you're plugging into an ecosystem.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="quick-start-with-containers">Quick Start With Containers<a href="https://llamastack.github.io/blog/mlflow-observability#quick-start-with-containers" class="hash-link" aria-label="Direct link to Quick Start With Containers" title="Direct link to Quick Start With Containers" translate="no">​</a></h2>
<p>A pending PR, <a href="https://github.com/llamastack/llama-stack/pull/5409" target="_blank" rel="noopener noreferrer" class="">feat: add MLflow support for LlamaStack</a>, will let you run LlamaStack alongside MLflow, Grafana, and Prometheus in containers with a single command. Once that PR lands, use the <a href="https://github.com/llamastack/llama-stack/tree/main/scripts/telemetry" target="_blank" rel="noopener noreferrer" class="">telemetry scripts</a> in the LlamaStack repository for the full walkthrough.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="references">References<a href="https://llamastack.github.io/blog/mlflow-observability#references" class="hash-link" aria-label="Direct link to References" title="Direct link to References" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://mlflow.org/docs/latest/llms/tracing/index.html" target="_blank" rel="noopener noreferrer" class="">MLflow Tracing Documentation</a></li>
<li class=""><a href="https://opentelemetry.io/docs/collector/" target="_blank" rel="noopener noreferrer" class="">OpenTelemetry Collector Documentation</a></li>
<li class=""><a href="https://github.com/open-telemetry/opentelemetry-python-contrib" target="_blank" rel="noopener noreferrer" class="">OpenTelemetry OpenAI Instrumentation</a></li>
</ul>]]></content>
        <author>
            <name>Guangya Liu</name>
            <uri>https://github.com/gyliu513</uri>
        </author>
        <category label="mlflow" term="mlflow"/>
        <category label="observability" term="observability"/>
        <category label="opentelemetry" term="opentelemetry"/>
        <category label="metrics" term="metrics"/>
        <category label="tracing" term="tracing"/>
        <category label="monitoring" term="monitoring"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Llama Stack Observability: Metrics, Traces, and Dashboards with OpenTelemetry]]></title>
        <id>https://llamastack.github.io/blog/observability-for-llama-stack</id>
        <link href="https://llamastack.github.io/blog/observability-for-llama-stack"/>
        <updated>2026-03-30T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Running an LLM application in production is nothing like running a traditional web service. Responses are non-deterministic. Latency swings wildly with model size and token count. And failures are often silent — a tool call that returns garbage still comes back as a 200 OK. You can stare at your HTTP dashboard all day and have no idea that half your users are getting bad answers.]]></summary>
        <content type="html"><![CDATA[<p>Running an LLM application in production is nothing like running a traditional web service. Responses are non-deterministic. Latency swings wildly with model size and token count. And failures are often silent — a tool call that returns garbage still comes back as a 200 OK. You can stare at your HTTP dashboard all day and have no idea that half your users are getting bad answers.</p>
<p>We recently shipped built-in observability for Llama Stack, powered by <a href="https://opentelemetry.io/" target="_blank" rel="noopener noreferrer" class="">OpenTelemetry</a>. Three environment variables, zero code changes, and you get metrics and traces from every layer — HTTP requests, inference calls, tool invocations, vector store operations, all the way down.</p>
<p>This post explains the architecture behind it, walks through a hands-on tutorial, and shows what you can actually see once it's running.</p>
<p>{/<em>truncate</em>/}</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="why-observability-matters-for-llm-applications">Why Observability Matters for LLM Applications<a href="https://llamastack.github.io/blog/observability-for-llama-stack#why-observability-matters-for-llm-applications" class="hash-link" aria-label="Direct link to Why Observability Matters for LLM Applications" title="Direct link to Why Observability Matters for LLM Applications" translate="no">​</a></h2>
<p>If you've operated traditional services, you know the drill: uptime checks, error rates, latency percentiles. LLM applications need all of that, plus a whole category of signals that don't exist in conventional backends.</p>
<p><strong>Latency is multi-dimensional.</strong> A single <code>/v1/responses</code> call might fan out to an inference provider, three tool calls, and two vector store queries. Knowing the overall P99 is 4 seconds doesn't help you — you need to know which leg is slow.</p>
<p><strong>Token economics drive cost.</strong> Without tracking tokens-per-second and usage patterns across models and providers, capacity planning is guesswork.</p>
<p><strong>Time-to-first-token (TTFT) defines user experience.</strong> A streaming response with a 5-second TTFT feels broken to the user, even if total latency is fine.</p>
<p><strong>Silent failures are common.</strong> A tool invocation that times out, a vector search that returns zero results, a safety shield that blocks unexpectedly — none of these produce HTTP errors, but all degrade quality. You won't find them in your access logs.</p>
<p><strong>Provider comparison requires data.</strong> When you run multiple inference backends (vLLM, Ollama, OpenAI), you need apples-to-apples latency and reliability numbers, not vibes.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="how-we-instrumented-llama-stack">How We Instrumented Llama Stack<a href="https://llamastack.github.io/blog/observability-for-llama-stack#how-we-instrumented-llama-stack" class="hash-link" aria-label="Direct link to How We Instrumented Llama Stack" title="Direct link to How We Instrumented Llama Stack" translate="no">​</a></h2>
<p>We chose <a href="https://opentelemetry.io/" target="_blank" rel="noopener noreferrer" class="">OpenTelemetry</a> (OTel) — the CNCF's vendor-neutral standard for metrics, traces, and logs. The practical upside: you export to Prometheus, Grafana, Jaeger, Datadog, or any OTLP-compatible backend, and you can switch without touching application code.</p>
<p>The instrumentation has two layers that work together. Both feed into the OpenTelemetry SDK, which batches and exports signals to the Collector via OTLP.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="auto-instrumentation-the-infrastructure-view">Auto instrumentation: the infrastructure view<a href="https://llamastack.github.io/blog/observability-for-llama-stack#auto-instrumentation-the-infrastructure-view" class="hash-link" aria-label="Direct link to Auto instrumentation: the infrastructure view" title="Direct link to Auto instrumentation: the infrastructure view" translate="no">​</a></h3>
<p>Launch Llama Stack with the <code>opentelemetry-instrument</code> CLI wrapper and you get — with zero code changes:</p>
<ul>
<li class="">Inbound HTTP spans and metrics from FastAPI (every API request)</li>
<li class="">Outbound HTTP spans and metrics from httpx (calls to inference providers)</li>
<li class="">Database query spans from SQLAlchemy and asyncpg</li>
<li class="">GenAI call spans from OTel ecosystem packages for OpenAI, Bedrock, Vertex AI, etc. — model name, token counts, finish reasons, all captured at the SDK level</li>
</ul>
<p>This covers the "infrastructure view": request flow, provider latency, GenAI call details, database performance.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="manual-instrumentation-the-application-view">Manual instrumentation: the application view<a href="https://llamastack.github.io/blog/observability-for-llama-stack#manual-instrumentation-the-application-view" class="hash-link" aria-label="Direct link to Manual instrumentation: the application view" title="Direct link to Manual instrumentation: the application view" translate="no">​</a></h3>
<p>Auto instrumentation doesn't know about Llama Stack's domain concepts. So we added manual instrumentation directly in the routers and middleware to capture:</p>
<ul>
<li class=""><strong>API request metrics</strong> — total count, duration histogram, concurrent request gauge</li>
<li class=""><strong>Inference metrics</strong> — end-to-end duration, TTFT, tokens-per-second</li>
<li class=""><strong>Vector IO metrics</strong> — insert, query, and delete counts with duration</li>
<li class=""><strong>Tool runtime metrics</strong> — invocation count and duration by tool name and status</li>
<li class=""><strong>Safety spans</strong> — shield evaluation traces with attribute context</li>
</ul>
<p>The two layers are complementary. Auto instrumentation tells you what's happening at the network and SDK level; manual instrumentation tells you what's happening at the application level. As a user, you don't need to care about any of this — just launch with <code>opentelemetry-instrument</code> and everything lights up.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="the-opentelemetry-collector">The OpenTelemetry Collector<a href="https://llamastack.github.io/blog/observability-for-llama-stack#the-opentelemetry-collector" class="hash-link" aria-label="Direct link to The OpenTelemetry Collector" title="Direct link to The OpenTelemetry Collector" translate="no">​</a></h2>
<p>Between Llama Stack and your observability backends sits the <a href="https://opentelemetry.io/docs/collector/" target="_blank" rel="noopener noreferrer" class="">OpenTelemetry Collector</a>. It receives OTLP data, processes it, and fans out to one or more destinations.</p>
<!-- -->
<p>The pipeline has three stages:</p>
<p><strong>Receivers</strong> define how data enters. Llama Stack pushes to the OTLP receiver on port 4318 (HTTP) or 4317 (gRPC). You can run additional receivers in parallel — for example, a Prometheus scrape receiver for other services in your infrastructure.</p>
<p><strong>Processors</strong> transform data in flight. The ones that matter for production: <code>batch</code> (groups telemetry for efficient network transfer), <code>memory_limiter</code> (drops data under memory pressure instead of OOM-ing), <code>attributes</code> (inject labels like <code>environment=production</code>), and <code>filter</code> (drop noise like health check spans). They run in the order you define them — a typical chain is <code>memory_limiter → batch → attributes</code>.</p>
<p><strong>Exporters</strong> send data to backends. <code>prometheusremotewrite</code> for Prometheus-compatible stores, <code>otlp</code> for Jaeger/Tempo/Datadog, <code>debug</code> for stdout during development. A single Collector can export metrics to Prometheus for dashboarding AND to Datadog for alerting simultaneously.</p>
<p>The key benefit: Llama Stack only speaks OTLP. The Collector handles format conversion, retries, and routing. Swap backends without changing a line of application code.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="end-to-end-data-flow">End-to-End Data Flow<a href="https://llamastack.github.io/blog/observability-for-llama-stack#end-to-end-data-flow" class="hash-link" aria-label="Direct link to End-to-End Data Flow" title="Direct link to End-to-End Data Flow" translate="no">​</a></h2>
<p>Here's what happens when a request comes in:</p>
<!-- -->
<p>Two things worth noting. First, recording is non-blocking — metrics write to an in-memory buffer, so they add negligible latency to the request path. Second, export is batched — the SDK flushes every 60 seconds by default, which means dashboards have up to a minute of delay, but request handling is never blocked by network I/O to the Collector.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="hands-on-tutorial">Hands-On Tutorial<a href="https://llamastack.github.io/blog/observability-for-llama-stack#hands-on-tutorial" class="hash-link" aria-label="Direct link to Hands-On Tutorial" title="Direct link to Hands-On Tutorial" translate="no">​</a></h2>
<p>Let's set everything up. By the end you'll have distributed tracing in Jaeger, metrics in Prometheus, and pre-built dashboards in Grafana.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-0-prerequisites">Step 0: Prerequisites<a href="https://llamastack.github.io/blog/observability-for-llama-stack#step-0-prerequisites" class="hash-link" aria-label="Direct link to Step 0: Prerequisites" title="Direct link to Step 0: Prerequisites" translate="no">​</a></h3>
<p>You'll need:</p>
<ul>
<li class=""><strong>Docker</strong> or <strong>Podman</strong> for running the observability stack</li>
<li class="">A working <strong>Llama Stack</strong> installation with <code>uv</code></li>
</ul>
<p>Clone the repo if you haven't:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token function" style="color:hsl(207, 82%, 66%)">git</span><span class="token plain"> clone https://github.com/llamastack/llama-stack.git</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token builtin class-name" style="color:hsl(29, 54%, 61%)">cd</span><span class="token plain"> llama-stack</span><br></span></code></pre></div></div>
<p>The telemetry configs live in <a href="https://github.com/llamastack/llama-stack/tree/main/scripts/telemetry" target="_blank" rel="noopener noreferrer" class=""><code>scripts/telemetry/</code></a>:</p>

































<table><thead><tr><th>File</th><th>What it does</th></tr></thead><tbody><tr><td><code>setup_telemetry.sh</code></td><td>Starts all telemetry services</td></tr><tr><td><code>otel-collector-config.yaml</code></td><td>Collector pipeline config</td></tr><tr><td><code>prometheus.yml</code></td><td>Prometheus scrape config</td></tr><tr><td><code>grafana-datasources.yaml</code></td><td>Grafana datasource provisioning</td></tr><tr><td><code>grafana-dashboards.yaml</code></td><td>Grafana dashboard provisioning</td></tr><tr><td><code>llama-stack-dashboard.json</code></td><td>Pre-built Grafana dashboard</td></tr></tbody></table>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-1-deploy-the-observability-stack">Step 1: Deploy the Observability Stack<a href="https://llamastack.github.io/blog/observability-for-llama-stack#step-1-deploy-the-observability-stack" class="hash-link" aria-label="Direct link to Step 1: Deploy the Observability Stack" title="Direct link to Step 1: Deploy the Observability Stack" translate="no">​</a></h3>
<p>One script brings up Jaeger, the OTel Collector, Prometheus, and Grafana:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token comment" style="color:hsl(220, 10%, 40%)"># Auto-detect container runtime (podman or docker)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">./scripts/telemetry/setup_telemetry.sh</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Or specify explicitly</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">./scripts/telemetry/setup_telemetry.sh </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--container</span><span class="token plain"> </span><span class="token function" style="color:hsl(207, 82%, 66%)">docker</span><br></span></code></pre></div></div>
<p>This creates a <code>llama-telemetry</code> container network, starts all four services, and provisions Grafana with a pre-built dashboard.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-2-install-opentelemetry-packages">Step 2: Install OpenTelemetry Packages<a href="https://llamastack.github.io/blog/observability-for-llama-stack#step-2-install-opentelemetry-packages" class="hash-link" aria-label="Direct link to Step 2: Install OpenTelemetry Packages" title="Direct link to Step 2: Install OpenTelemetry Packages" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">uv pip </span><span class="token function" style="color:hsl(207, 82%, 66%)">install</span><span class="token plain"> opentelemetry-distro opentelemetry-exporter-otlp</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">uv run opentelemetry-bootstrap </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">-a</span><span class="token plain"> requirements </span><span class="token operator" style="color:hsl(207, 82%, 66%)">|</span><span class="token plain"> uv pip </span><span class="token function" style="color:hsl(207, 82%, 66%)">install</span><span class="token plain"> </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--requirement</span><span class="token plain"> -</span><br></span></code></pre></div></div>
<p><code>opentelemetry-bootstrap</code> detects your installed libraries (FastAPI, httpx, SQLAlchemy, OpenAI SDK, etc.) and installs the matching instrumentation packages automatically.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-3-launch-the-server">Step 3: Launch the Server<a href="https://llamastack.github.io/blog/observability-for-llama-stack#step-3-launch-the-server" class="hash-link" aria-label="Direct link to Step 3: Launch the Server" title="Direct link to Step 3: Launch the Server" translate="no">​</a></h3>
<p>Set three environment variables and wrap the launch command with <code>opentelemetry-instrument</code>:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token builtin class-name" style="color:hsl(29, 54%, 61%)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_EXPORTER_OTLP_ENDPOINT</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">http://localhost:4318</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token builtin class-name" style="color:hsl(29, 54%, 61%)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_EXPORTER_OTLP_PROTOCOL</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">http/protobuf</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token builtin class-name" style="color:hsl(29, 54%, 61%)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_SERVICE_NAME</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">llama-stack-server</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">uv run opentelemetry-instrument llama stack run starter</span><br></span></code></pre></div></div>
<p>That's it. When <code>OTEL_EXPORTER_OTLP_ENDPOINT</code> is set, both auto and manual instrumentation activate. When it's not set, metrics are recorded in memory but never exported — no overhead, no errors.</p>






























<table><thead><tr><th>Variable</th><th>Purpose</th><th>Example</th></tr></thead><tbody><tr><td><code>OTEL_EXPORTER_OTLP_ENDPOINT</code></td><td>Collector endpoint</td><td><code>http://localhost:4318</code></td></tr><tr><td><code>OTEL_EXPORTER_OTLP_PROTOCOL</code></td><td>Transport protocol</td><td><code>http/protobuf</code></td></tr><tr><td><code>OTEL_SERVICE_NAME</code></td><td>Service name in telemetry</td><td><code>llama-stack-server</code></td></tr><tr><td><code>OTEL_METRIC_EXPORT_INTERVAL</code></td><td>Export interval (ms)</td><td><code>60000</code> (default)</td></tr></tbody></table>
<blockquote>
<p><strong>Tip</strong>: If you see duplicate database traces, set <code>OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3,asyncpg"</code> to disable overlapping instrumentors.</p>
</blockquote>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-4-launch-your-client">Step 4: Launch Your Client<a href="https://llamastack.github.io/blog/observability-for-llama-stack#step-4-launch-your-client" class="hash-link" aria-label="Direct link to Step 4: Launch Your Client" title="Direct link to Step 4: Launch Your Client" translate="no">​</a></h3>
<p>To get end-to-end distributed tracing, launch your client the same way:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token builtin class-name" style="color:hsl(29, 54%, 61%)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_EXPORTER_OTLP_ENDPOINT</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">http://localhost:4318</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token builtin class-name" style="color:hsl(29, 54%, 61%)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_EXPORTER_OTLP_PROTOCOL</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">http/protobuf</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token builtin class-name" style="color:hsl(29, 54%, 61%)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_SERVICE_NAME</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">my-llama-stack-app</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">opentelemetry-instrument python my_app.py</span><br></span></code></pre></div></div>
<p>A minimal example:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token keyword" style="color:hsl(286, 60%, 67%)">from</span><span class="token plain"> openai </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">import</span><span class="token plain"> OpenAI</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">client </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> OpenAI</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">api_key</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"fake"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> base_url</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"http://localhost:8321/v1/"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">response </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">chat</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">completions</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"openai/gpt-4o-mini"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    messages</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string" style="color:hsl(95, 38%, 62%)">"role"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"user"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"content"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"Hello, how are you?"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token keyword" style="color:hsl(286, 60%, 67%)">print</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">response</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">choices</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token number" style="color:hsl(29, 54%, 61%)">0</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">message</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">content</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<p>With <code>opentelemetry-instrument</code>, this client automatically generates GenAI spans (model, token counts, finish reasons) and HTTP spans, all correlated with server-side traces via W3C trace context propagation.</p>
<p>By default, message content (prompts, outputs, tool arguments) is <strong>not</strong> captured for privacy. To enable content capture for debugging:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">true </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">opentelemetry-instrument python my_app.py</span><br></span></code></pre></div></div>
<p>Captured content appears as log events (<code>gen_ai.user.message</code>, <code>gen_ai.choice</code>) correlated with trace spans via <code>trace_id</code>/<code>span_id</code>. The spans themselves carry structured metadata (model, token usage, latency) but not the raw text.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-5-explore-the-data">Step 5: Explore the Data<a href="https://llamastack.github.io/blog/observability-for-llama-stack#step-5-explore-the-data" class="hash-link" aria-label="Direct link to Step 5: Explore the Data" title="Direct link to Step 5: Explore the Data" translate="no">​</a></h3>
<p>Once traffic is flowing:</p>

























<table><thead><tr><th>Service</th><th>URL</th><th>Credentials</th></tr></thead><tbody><tr><td><strong>Jaeger</strong> (traces)</td><td><a href="http://localhost:16686/" target="_blank" rel="noopener noreferrer" class="">http://localhost:16686</a></td><td>N/A</td></tr><tr><td><strong>Prometheus</strong> (metrics)</td><td><a href="http://localhost:9090/" target="_blank" rel="noopener noreferrer" class="">http://localhost:9090</a></td><td>N/A</td></tr><tr><td><strong>Grafana</strong> (dashboards)</td><td><a href="http://localhost:3000/" target="_blank" rel="noopener noreferrer" class="">http://localhost:3000</a></td><td>admin / admin</td></tr></tbody></table>
<h4 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="jaeger-distributed-traces">Jaeger: Distributed Traces<a href="https://llamastack.github.io/blog/observability-for-llama-stack#jaeger-distributed-traces" class="hash-link" aria-label="Direct link to Jaeger: Distributed Traces" title="Direct link to Jaeger: Distributed Traces" translate="no">​</a></h4>
<p>Select the <code>llama-stack-server</code> or <code>my-llama-stack-app</code> service to see request traces. Each trace shows the full request lifecycle — client HTTP call → FastAPI handler → inference provider call → database operations. You can pinpoint exactly where time is spent.</p>
<p><img decoding="async" loading="lazy" alt="Jaeger distributed traces for Llama Stack" src="https://llamastack.github.io/assets/images/jaeger-d1b593c5b0d53a3539eb1d97bbd454e9.png" width="3724" height="2180" class="img_ev3q"></p>
<h4 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="prometheus-metrics-queries">Prometheus: Metrics Queries<a href="https://llamastack.github.io/blog/observability-for-llama-stack#prometheus-metrics-queries" class="hash-link" aria-label="Direct link to Prometheus: Metrics Queries" title="Direct link to Prometheus: Metrics Queries" translate="no">​</a></h4>
<p>Some useful PromQL to get you started:</p>





































<table><thead><tr><th>What you want to know</th><th>PromQL</th></tr></thead><tbody><tr><td>Input token usage by model</td><td><code>sum by(gen_ai_request_model) (llama_stack_gen_ai_client_token_usage_sum{gen_ai_token_type="input"})</code></td></tr><tr><td>Output token usage by model</td><td><code>sum by(gen_ai_request_model) (llama_stack_gen_ai_client_token_usage_sum{gen_ai_token_type="output"})</code></td></tr><tr><td>P95 HTTP server latency</td><td><code>histogram_quantile(0.95, rate(llama_stack_http_server_duration_milliseconds_bucket[5m]))</code></td></tr><tr><td>P99 inference duration</td><td><code>histogram_quantile(0.99, rate(llama_stack_inference_duration_seconds_bucket[5m]))</code></td></tr><tr><td>P95 TTFT by model</td><td><code>histogram_quantile(0.95, rate(llama_stack_inference_time_to_first_token_seconds_bucket[5m]))</code></td></tr><tr><td>Median tokens/sec by provider</td><td><code>histogram_quantile(0.5, rate(llama_stack_inference_tokens_per_second_bucket[5m]))</code></td></tr><tr><td>Tool invocation errors</td><td><code>rate(llama_stack_tool_runtime_invocations_total{status="error"}[5m])</code></td></tr></tbody></table>
<p><img decoding="async" loading="lazy" alt="Prometheus metrics for Llama Stack" src="https://llamastack.github.io/assets/images/prometheus-341a2f2c004dc89633a5df3543eef620.png" width="2990" height="1652" class="img_ev3q"></p>
<h4 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="grafana-pre-built-dashboard">Grafana: Pre-built Dashboard<a href="https://llamastack.github.io/blog/observability-for-llama-stack#grafana-pre-built-dashboard" class="hash-link" aria-label="Direct link to Grafana: Pre-built Dashboard" title="Direct link to Grafana: Pre-built Dashboard" translate="no">​</a></h4>
<p>A <strong>Llama Stack</strong> dashboard is automatically provisioned with panels for prompt tokens, completion tokens, P95/P99 HTTP duration, and request volume. It's a starting point — extend it with the PromQL queries above for inference-specific views.</p>
<p><img decoding="async" loading="lazy" alt="Grafana pre-built dashboard for Llama Stack" src="https://llamastack.github.io/assets/images/grafana-4066c802715b016a24f501d6e181c9c2.png" width="3256" height="2006" class="img_ev3q"></p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="step-6-set-up-alerts">Step 6: Set Up Alerts<a href="https://llamastack.github.io/blog/observability-for-llama-stack#step-6-set-up-alerts" class="hash-link" aria-label="Direct link to Step 6: Set Up Alerts" title="Direct link to Step 6: Set Up Alerts" translate="no">​</a></h3>
<p>With metrics in Prometheus, you can set up alerts for the things that actually page you at 3 AM:</p>
<ul>
<li class=""><strong>High latency</strong>: P99 inference duration &gt; 10s sustained for 5 minutes</li>
<li class=""><strong>Error rate spike</strong>: Error rate &gt; 5% over a 5-minute window</li>
<li class=""><strong>Provider down</strong>: Zero successful requests to a provider for 2 minutes</li>
<li class=""><strong>Capacity warning</strong>: Concurrent requests consistently above threshold</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="cleanup">Cleanup<a href="https://llamastack.github.io/blog/observability-for-llama-stack#cleanup" class="hash-link" aria-label="Direct link to Cleanup" title="Direct link to Cleanup" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token function" style="color:hsl(207, 82%, 66%)">docker</span><span class="token plain"> stop jaeger otel-collector prometheus grafana</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token function" style="color:hsl(207, 82%, 66%)">docker</span><span class="token plain"> </span><span class="token function" style="color:hsl(207, 82%, 66%)">rm</span><span class="token plain"> jaeger otel-collector prometheus grafana</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token function" style="color:hsl(207, 82%, 66%)">docker</span><span class="token plain"> network </span><span class="token function" style="color:hsl(207, 82%, 66%)">rm</span><span class="token plain"> llama-telemetry</span><br></span></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="whats-next">What's Next<a href="https://llamastack.github.io/blog/observability-for-llama-stack#whats-next" class="hash-link" aria-label="Direct link to What's Next" title="Direct link to What's Next" translate="no">​</a></h2>
<p>The instrumentation is in place, and we're planning to expand it. If you have ideas for metrics that would help you operate Llama Stack in production or if you've built interesting dashboards on top of what's there, we'd love to hear about it. Open an issue or check the <a href="https://github.com/llamastack/llama-stack/blob/main/CONTRIBUTING.md" target="_blank" rel="noopener noreferrer" class="">contributing guide</a>.</p>]]></content>
        <author>
            <name>Guangya Liu</name>
            <uri>https://github.com/gyliu513</uri>
        </author>
        <category label="observability" term="observability"/>
        <category label="opentelemetry" term="opentelemetry"/>
        <category label="metrics" term="metrics"/>
        <category label="tracing" term="tracing"/>
        <category label="monitoring" term="monitoring"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Llama Stack Achieves 100% Open Responses Compliance: Enterprise-Grade OpenAI Compatibility for Your Infrastructure]]></title>
        <id>https://llamastack.github.io/blog/open-responses-openai-compatibility</id>
        <link href="https://llamastack.github.io/blog/open-responses-openai-compatibility"/>
        <updated>2026-03-20T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We're excited to share that Llama Stack has achieved 100% compliance with the Open Responses specification and been officially recognized as part of the Open Responses community. This milestone represents more than just compatibility: it's about bringing enterprise-grade AI capabilities to your own infrastructure with the familiarity of OpenAI APIs.]]></summary>
        <content type="html"><![CDATA[<p>We're excited to share that Llama Stack has achieved <strong>100% compliance with the Open Responses specification</strong> and been officially recognized as part of the <a href="https://github.com/openresponses/openresponses/pull/29" target="_blank" rel="noopener noreferrer" class="">Open Responses community</a>. This milestone represents more than just compatibility: it's about bringing enterprise-grade AI capabilities to your own infrastructure with the familiarity of OpenAI APIs.</p>
<p>With comprehensive support for Files, Vector Stores, Search, Conversations, Prompts, Chat Completions, the full Responses API, plus powerful extensions like MCP tool integration, Tool Calling, and Connectors, Llama Stack offers something unique in the AI infrastructure landscape: a SaaS-like experience that runs entirely on your terms.</p>
<p>{/<em>truncate</em>/}</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="recognition-by-the-open-responses-community">Recognition by the Open Responses Community<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#recognition-by-the-open-responses-community" class="hash-link" aria-label="Direct link to Recognition by the Open Responses Community" title="Direct link to Recognition by the Open Responses Community" translate="no">​</a></h2>
<p>The <a href="https://www.openresponses.org/" target="_blank" rel="noopener noreferrer" class="">Open Responses initiative</a> represents a collaborative effort to standardize agentic AI interfaces across the industry, with backing from OpenAI, Hugging Face, and leading providers like Ollama, vLLM, and LM Studio. Our acceptance into this community validates Llama Stack's commitment to open standards and interoperability.</p>
<p>What makes this recognition particularly meaningful is our approach to compliance. We don't just aim for compatibility—<strong>we run the full Open Responses acceptance test suite on every pull request as a blocking requirement</strong>. This means our perfect 6/6 test pass rate isn't a one-time achievement; it's a maintained standard that ensures consistent, reliable behavior for developers building on open standards.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="comprehensive-openai-api-feature-support">Comprehensive OpenAI API Feature Support<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#comprehensive-openai-api-feature-support" class="hash-link" aria-label="Direct link to Comprehensive OpenAI API Feature Support" title="Direct link to Comprehensive OpenAI API Feature Support" translate="no">​</a></h2>
<p>Llama Stack delivers comprehensive feature parity across multiple API surfaces, giving you the full power of modern AI APIs.</p>
<blockquote>
<p><strong>A note on model IDs:</strong> The model ID you pass depends on the inference provider backing your Llama Stack server. For example, with Ollama you'd use <code>ollama/llama3.2:3b</code>, while with Fireworks or Together you'd use the HuggingFace-style <code>meta-llama/Llama-3.2-3B-Instruct</code>. The API calls are identical either way.</p>
</blockquote>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="files-api---openai-compatible-document-management"><strong>Files API</strong> - OpenAI-Compatible Document Management<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#files-api---openai-compatible-document-management" class="hash-link" aria-label="Direct link to files-api---openai-compatible-document-management" title="Direct link to files-api---openai-compatible-document-management" translate="no">​</a></h3>
<p>Upload, manage, and process documents with the same interface you'd use with OpenAI:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token comment" style="color:hsl(220, 10%, 40%)"># Works identically with OpenAI or Llama Stack clients</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token keyword" style="color:hsl(286, 60%, 67%)">from</span><span class="token plain"> openai </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">import</span><span class="token plain"> OpenAI</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">client </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> OpenAI</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">base_url</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"http://localhost:8321/v1/"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> api_key</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"none"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token builtin" style="color:hsl(95, 38%, 62%)">file</span><span class="token plain"> </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">files</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">file</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">open</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token string" style="color:hsl(95, 38%, 62%)">"document.pdf"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"rb"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> purpose</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"assistants"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># List and manage files</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">files </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">files</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">list</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">content </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">files</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">content</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">file</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="vector-stores-api---rag-without-vendor-lock-in"><strong>Vector Stores API</strong> - RAG Without Vendor Lock-in<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#vector-stores-api---rag-without-vendor-lock-in" class="hash-link" aria-label="Direct link to vector-stores-api---rag-without-vendor-lock-in" title="Direct link to vector-stores-api---rag-without-vendor-lock-in" translate="no">​</a></h3>
<p>Build retrieval-augmented generation applications using the full Vector Stores API:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token comment" style="color:hsl(220, 10%, 40%)"># Create vector stores with nested file management</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">vector_store </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">vector_stores</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">name</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"knowledge_base"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Add files and manage vector store content</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">vector_stores</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">files</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">vector_store_id</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">vector_store</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> file_id</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">file</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Search functionality built-in</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">results </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">vector_stores</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">search</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    vector_store_id</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">vector_store</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> query</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"What is our refund policy?"</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conversations-api---persistent-context-management"><strong>Conversations API</strong> - Persistent Context Management<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#conversations-api---persistent-context-management" class="hash-link" aria-label="Direct link to conversations-api---persistent-context-management" title="Direct link to conversations-api---persistent-context-management" translate="no">​</a></h3>
<p>Manage conversation state and continuity across interactions:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token comment" style="color:hsl(220, 10%, 40%)"># Create a conversation</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">conversation </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">conversations</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Add items to a conversation</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">conversations</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">items</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    conversation_id</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">conversation</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    items</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string" style="color:hsl(95, 38%, 62%)">"role"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"user"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"content"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"Tell me about our product features"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string" style="color:hsl(95, 38%, 62%)">"role"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"assistant"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"content"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"I'd be happy to explain..."</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Retrieve conversation history</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">items </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">conversations</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">items</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">list</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">conversation_id</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">conversation</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="chat-completions--responses---simple-chat-to-agentic-workflows"><strong>Chat Completions &amp; Responses</strong> - Simple Chat to Agentic Workflows<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#chat-completions--responses---simple-chat-to-agentic-workflows" class="hash-link" aria-label="Direct link to chat-completions--responses---simple-chat-to-agentic-workflows" title="Direct link to chat-completions--responses---simple-chat-to-agentic-workflows" translate="no">​</a></h3>
<p>From straightforward inference to multi-tool orchestration:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token comment" style="color:hsl(220, 10%, 40%)"># Standard chat completions (e.g., with Ollama)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">completion </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">chat</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">completions</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"ollama/gpt-oss:20b"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> messages</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string" style="color:hsl(95, 38%, 62%)">"role"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"user"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"content"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"Explain RAG"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Advanced responses with tool orchestration (e.g., with Fireworks)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">response </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">responses</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"ollama/gpt-oss:20b"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token builtin" style="color:hsl(95, 38%, 62%)">input</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"What documents mention our pricing strategy?"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    tools</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string" style="color:hsl(95, 38%, 62%)">"type"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"file_search"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="prompts-api---programmatic-prompt-management"><strong>Prompts API</strong> - Programmatic Prompt Management<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#prompts-api---programmatic-prompt-management" class="hash-link" aria-label="Direct link to prompts-api---programmatic-prompt-management" title="Direct link to prompts-api---programmatic-prompt-management" translate="no">​</a></h3>
<p>Llama Stack extends OpenAI compatibility with full programmatic prompt management. With OpenAI, prompts are created through their admin portal and referenced by ID in the Responses API. Llama Stack provides the same referencing pattern, plus a complete CRUD API for creating and managing prompts programmatically:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token keyword" style="color:hsl(286, 60%, 67%)">from</span><span class="token plain"> llama_stack_client </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">import</span><span class="token plain"> LlamaStackClient</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">ls_client </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> LlamaStackClient</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Create reusable prompt templates with variables</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">prompt </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> ls_client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompts</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    prompt</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"You are a {{ role }} assistant. Analyze this: {{ content }}"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    variables</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token string" style="color:hsl(95, 38%, 62%)">"role"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"content"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Reference prompts in responses — compatible with OpenAI's pattern</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">response </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">responses</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"ollama/gpt-oss:20b"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token builtin" style="color:hsl(95, 38%, 62%)">input</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string" style="color:hsl(95, 38%, 62%)">"role"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"user"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"content"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"Review our Q1 report"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    prompt</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token string" style="color:hsl(95, 38%, 62%)">"id"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> prompt</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompt_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token string" style="color:hsl(95, 38%, 62%)">"variables"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"role"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string" style="color:hsl(95, 38%, 62%)">"type"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"input_text"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"text"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"financial analyst"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"content"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string" style="color:hsl(95, 38%, 62%)">"type"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"input_text"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"text"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"Q1 2026 earnings report"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<p>This gives you the best of both worlds: compatibility with OpenAI's prompt referencing pattern in the Responses API, plus the ability to manage prompts as code rather than through a web interface.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="mcp-integration---extensible-tool-ecosystem"><strong>MCP Integration</strong> - Extensible Tool Ecosystem<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#mcp-integration---extensible-tool-ecosystem" class="hash-link" aria-label="Direct link to mcp-integration---extensible-tool-ecosystem" title="Direct link to mcp-integration---extensible-tool-ecosystem" translate="no">​</a></h3>
<p>Leverage the Model Context Protocol to connect to any MCP server and dynamically discover tools:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token comment" style="color:hsl(220, 10%, 40%)"># Connect to MCP servers for dynamic tool discovery</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">response </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">responses</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"ollama/gpt-oss:20b"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token builtin" style="color:hsl(95, 38%, 62%)">input</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"What parks are in Rhode Island, and are there upcoming events?"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    tools</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"type"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"mcp"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"server_label"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"parks-service"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"server_url"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"http://parks-mcp-server:8000/sse"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<p>MCP tools support per-request authorization, allowed tool filtering, and automatic session management. Connect to databases, APIs, and internal services through the growing ecosystem of standard MCP servers—no custom integration work required.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="connectors---declarative-service-integration"><strong>Connectors</strong> - Declarative Service Integration<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#connectors---declarative-service-integration" class="hash-link" aria-label="Direct link to connectors---declarative-service-integration" title="Direct link to connectors---declarative-service-integration" translate="no">​</a></h3>
<p>Connectors provide a configuration-driven approach to integrating external services with your Llama Stack deployment. Define your data sources and services in your stack configuration, and they're automatically available as tools for your agents to use.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="the-value-proposition-saas-experience-your-infrastructure">The Value Proposition: SaaS Experience, Your Infrastructure<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#the-value-proposition-saas-experience-your-infrastructure" class="hash-link" aria-label="Direct link to The Value Proposition: SaaS Experience, Your Infrastructure" title="Direct link to The Value Proposition: SaaS Experience, Your Infrastructure" translate="no">​</a></h2>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="data-sovereignty--security"><strong>Data Sovereignty &amp; Security</strong><a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#data-sovereignty--security" class="hash-link" aria-label="Direct link to data-sovereignty--security" title="Direct link to data-sovereignty--security" translate="no">​</a></h3>
<p>For regulated industries like finance, healthcare, and government, sending sensitive documents to external APIs isn't an option. Llama Stack solves this by running entirely on your infrastructure:</p>
<ul>
<li class=""><strong>Documents never leave your environment</strong>: RAG pipelines, vector storage, and model inference all happen locally</li>
<li class=""><strong>Compliance-ready</strong>: Meet HIPAA, SOC 2, GDPR, and other regulatory requirements</li>
<li class=""><strong>Audit trails</strong>: Full visibility into data processing and model decisions</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="cost-control--predictability"><strong>Cost Control &amp; Predictability</strong><a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#cost-control--predictability" class="hash-link" aria-label="Direct link to cost-control--predictability" title="Direct link to cost-control--predictability" translate="no">​</a></h3>
<p>Unlike consumption-based pricing models, Llama Stack offers:</p>
<ul>
<li class=""><strong>Fixed infrastructure costs</strong>: Pay for compute, not tokens</li>
<li class=""><strong>No usage surprises</strong>: Predictable costs regardless of application load</li>
<li class=""><strong>Efficient resource utilization</strong>: Choose the right model size for your use case</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="model-freedom"><strong>Model Freedom</strong><a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#model-freedom" class="hash-link" aria-label="Direct link to model-freedom" title="Direct link to model-freedom" translate="no">​</a></h3>
<p>Break free from vendor-specific models:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token comment" style="color:hsl(220, 10%, 40%)"># Same API, different models — swap without code changes</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token keyword" style="color:hsl(286, 60%, 67%)">for</span><span class="token plain"> model </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">in</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token string" style="color:hsl(95, 38%, 62%)">"ollama/gpt-oss:20b"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"ollama/llama3.2:3b"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"your-org/custom-model"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    response </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">chat</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">completions</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">model</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> messages</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">messages</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="getting-started-in-minutes">Getting Started in Minutes<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#getting-started-in-minutes" class="hash-link" aria-label="Direct link to Getting Started in Minutes" title="Direct link to Getting Started in Minutes" translate="no">​</a></h2>
<p>Whether you're prototyping locally or deploying at scale, Llama Stack makes it easy:</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="local-development"><strong>Local Development</strong><a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#local-development" class="hash-link" aria-label="Direct link to local-development" title="Direct link to local-development" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token comment" style="color:hsl(220, 10%, 40%)"># Set up your environment</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">uv venv </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--python</span><span class="token plain"> </span><span class="token number" style="color:hsl(29, 54%, 61%)">3.12</span><span class="token plain"> </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--seed</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token builtin class-name" style="color:hsl(29, 54%, 61%)">source</span><span class="token plain"> .venv/bin/activate</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">uv pip </span><span class="token function" style="color:hsl(207, 82%, 66%)">install</span><span class="token plain"> </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">-U</span><span class="token plain"> llama-stack</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">uv run llama stack list-deps starter </span><span class="token operator" style="color:hsl(207, 82%, 66%)">|</span><span class="token plain"> </span><span class="token function" style="color:hsl(207, 82%, 66%)">xargs</span><span class="token plain"> </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">-L1</span><span class="token plain"> uv pip </span><span class="token function" style="color:hsl(207, 82%, 66%)">install</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Start Ollama and pull a model</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">ollama serve</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">ollama run gpt-oss:20b</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Launch Llama Stack with the starter distribution</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OLLAMA_URL</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">http://localhost:11434/v1 uv run llama stack run starter</span><br></span></code></pre></div></div>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token comment" style="color:hsl(220, 10%, 40%)"># Use with the OpenAI client</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token keyword" style="color:hsl(286, 60%, 67%)">from</span><span class="token plain"> openai </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">import</span><span class="token plain"> OpenAI</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">client </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> OpenAI</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">base_url</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"http://localhost:8321/v1"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> api_key</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"none"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">response </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">responses</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"ollama/gpt-oss:20b"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(95, 38%, 62%)">input</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"Write a haiku about open source."</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token keyword" style="color:hsl(286, 60%, 67%)">print</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">response</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">output_text</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="production-deployment"><strong>Production Deployment</strong><a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#production-deployment" class="hash-link" aria-label="Direct link to production-deployment" title="Direct link to production-deployment" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token comment" style="color:hsl(220, 10%, 40%)"># Deploy with your preferred infrastructure</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Docker, Kubernetes, or bare metal — your choice</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token function" style="color:hsl(207, 82%, 66%)">docker</span><span class="token plain"> run </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">-p</span><span class="token plain"> </span><span class="token number" style="color:hsl(29, 54%, 61%)">8321</span><span class="token plain">:8321 llamastack/distribution-starter:latest</span><br></span></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="framework-ecosystem-compatibility">Framework Ecosystem Compatibility<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#framework-ecosystem-compatibility" class="hash-link" aria-label="Direct link to Framework Ecosystem Compatibility" title="Direct link to Framework Ecosystem Compatibility" translate="no">​</a></h2>
<p>One of Llama Stack's biggest advantages is <strong>drop-in compatibility</strong> with existing tooling:</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="direct-openai-client"><strong>Direct OpenAI Client</strong><a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#direct-openai-client" class="hash-link" aria-label="Direct link to direct-openai-client" title="Direct link to direct-openai-client" translate="no">​</a></h3>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token keyword" style="color:hsl(286, 60%, 67%)">from</span><span class="token plain"> openai </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">import</span><span class="token plain"> OpenAI</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Same code, different backend</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">client </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> OpenAI</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">base_url</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"http://your-llama-stack/v1"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> api_key</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"none"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="langchain-integration"><strong>LangChain Integration</strong><a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#langchain-integration" class="hash-link" aria-label="Direct link to langchain-integration" title="Direct link to langchain-integration" translate="no">​</a></h3>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token keyword" style="color:hsl(286, 60%, 67%)">from</span><span class="token plain"> langchain_openai </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">import</span><span class="token plain"> ChatOpenAI</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Point to your Llama Stack server</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">llm </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> ChatOpenAI</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    base_url</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"http://your-llama-stack/v1/openai/v1"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    api_key</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"none"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"ollama/gpt-oss:20b"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="native-llama-stack-client"><strong>Native Llama Stack Client</strong><a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#native-llama-stack-client" class="hash-link" aria-label="Direct link to native-llama-stack-client" title="Direct link to native-llama-stack-client" translate="no">​</a></h3>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token keyword" style="color:hsl(286, 60%, 67%)">from</span><span class="token plain"> llama_stack_client </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">import</span><span class="token plain"> LlamaStackClient</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Access the full Llama Stack API surface</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">client </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> LlamaStackClient</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">base_url</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"http://your-llama-stack"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="built-for-open-standards">Built for Open Standards<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#built-for-open-standards" class="hash-link" aria-label="Direct link to Built for Open Standards" title="Direct link to Built for Open Standards" translate="no">​</a></h2>
<p>Our 100% Open Responses compliance reflects a broader philosophy: <strong>open standards enable innovation</strong>. When you build on Llama Stack, you're not just adopting our implementation—you're investing in an ecosystem where:</p>
<ul>
<li class=""><strong>Applications are portable</strong>: Move between providers without rewriting code</li>
<li class=""><strong>Standards evolve collaboratively</strong>: Community-driven development rather than vendor dictates</li>
<li class=""><strong>Innovation is shared</strong>: Improvements benefit the entire ecosystem</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="technical-excellence-through-testing">Technical Excellence Through Testing<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#technical-excellence-through-testing" class="hash-link" aria-label="Direct link to Technical Excellence Through Testing" title="Direct link to Technical Excellence Through Testing" translate="no">​</a></h2>
<p>Achieving 100% Open Responses compliance required rigorous engineering:</p>
<ul>
<li class=""><strong>Perfect conformance testing</strong>: Every PR runs the full Open Responses test suite with 6/6 passing tests</li>
<li class=""><strong>Automated compliance validation</strong>: Blocking requirements ensure compliance is maintained, not achieved once</li>
<li class=""><strong>Production testing</strong>: Integration tests with real workloads and multi-provider scenarios</li>
<li class=""><strong>Comprehensive API coverage</strong>: Full implementation of the Open Responses specification</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="whats-next">What's Next<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#whats-next" class="hash-link" aria-label="Direct link to What's Next" title="Direct link to What's Next" translate="no">​</a></h2>
<p>Llama Stack's OpenAI compatibility is just the beginning. We're actively working on:</p>
<ul>
<li class=""><strong>Enhanced streaming support</strong>: Improved real-time response handling</li>
<li class=""><strong>Extended MCP ecosystem</strong>: Deeper tool integration and connector development</li>
<li class=""><strong>Performance optimizations</strong>: Faster inference and better resource utilization</li>
<li class=""><strong>Broader OpenAI API coverage</strong>: Expanding compatibility beyond our current feature set</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="join-the-open-ai-infrastructure-movement">Join the Open AI Infrastructure Movement<a href="https://llamastack.github.io/blog/open-responses-openai-compatibility#join-the-open-ai-infrastructure-movement" class="hash-link" aria-label="Direct link to Join the Open AI Infrastructure Movement" title="Direct link to Join the Open AI Infrastructure Movement" translate="no">​</a></h2>
<p>Llama Stack represents something new in the AI infrastructure landscape: <strong>enterprise-grade capabilities without vendor lock-in</strong>. Whether you're a startup building your first AI application or an enterprise looking to bring AI workloads in-house, Llama Stack provides the reliability, security, and compatibility you need.</p>
<p>Ready to get started?</p>
<ul>
<li class=""><strong>📚 <a href="https://llamastack.github.io/docs/" target="_blank" rel="noopener noreferrer" class="">Documentation</a></strong>: Comprehensive guides and API references</li>
<li class=""><strong>🚀 <a href="https://llamastack.github.io/docs/getting_started/" target="_blank" rel="noopener noreferrer" class="">Getting Started</a></strong>: Quick setup tutorials</li>
<li class=""><strong>🔧 <a href="https://llamastack.github.io/docs/providers/openai" target="_blank" rel="noopener noreferrer" class="">OpenAI Implementation Guide</a></strong>: Detailed compatibility examples</li>
<li class=""><strong>🔌 <a href="https://llamastack.github.io/docs/building_applications/" target="_blank" rel="noopener noreferrer" class="">MCP Integration</a></strong>: Tool ecosystem and connector guides</li>
<li class=""><strong>💬 <a href="https://github.com/llamastack/llama-stack" target="_blank" rel="noopener noreferrer" class="">Community</a></strong>: Join discussions and contribute</li>
</ul>
<p>The future of AI infrastructure is open, interoperable, and under your control. Welcome to Llama Stack.</p>]]></content>
        <author>
            <name>Francisco Javier Arceo</name>
            <uri>https://github.com/franciscojavierarceo</uri>
        </author>
        <author>
            <name>Charlie Doern</name>
            <uri>https://github.com/cdoern</uri>
        </author>
        <category label="openai-compatibility" term="openai-compatibility"/>
        <category label="open-responses" term="open-responses"/>
        <category label="enterprise" term="enterprise"/>
        <category label="mcp" term="mcp"/>
        <category label="connectors" term="connectors"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Your Agent, Your Rules: Building Powerful Agents with the Responses API in Llama Stack]]></title>
        <id>https://llamastack.github.io/blog/responses-api</id>
        <link href="https://llamastack.github.io/blog/responses-api"/>
        <updated>2026-03-18T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[The Responses API is rapidly emerging as one of the most influential interfaces for building AI agents. It handles multi-step reasoning, tool orchestration, and conversational state in a single interaction, which is a big improvement over the manual orchestration loops that developers had to build on top of chat completion APIs. Llama Stack's implementation of the Responses API brings these capabilities to the open source world, where you can choose your own models and run on your own infrastructure.]]></summary>
        <content type="html"><![CDATA[<p>The <a href="https://developers.openai.com/blog/responses-api" target="_blank" rel="noopener noreferrer" class="">Responses API</a> is rapidly emerging as one of the most influential interfaces for building AI agents. It handles multi-step reasoning, tool orchestration, and conversational state in a single interaction, which is a big improvement over the manual orchestration loops that developers had to build on top of chat completion APIs. Llama Stack's implementation of the Responses API brings these capabilities to the open source world, where you can choose your own models and run on your own infrastructure.</p>
<p>This post covers why the Responses API matters, what Llama Stack's implementation enables, and how it connects to the broader move toward open agent standards like Open Responses.</p>
<p>{/<em>truncate</em>/}</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="why-the-responses-api">Why the Responses API?<a href="https://llamastack.github.io/blog/responses-api#why-the-responses-api" class="hash-link" aria-label="Direct link to Why the Responses API?" title="Direct link to Why the Responses API?" translate="no">​</a></h2>
<p>Before the Responses API, building an agent that could use tools was a multi-step exercise in client-side orchestration. Your application had to call the model with a list of available tools, inspect the response for tool call requests, execute those tools, send the results back, and repeat until the model produced a final answer. All of the state management, error handling, and retry logic lived in your code.</p>
<p>This approach put a real burden on application developers. The orchestration logic got duplicated across every application, and subtle mistakes in state management could lead to poor accuracy or unnecessary model calls.</p>
<p>The Responses API moves this orchestration to the server. The client sends a question along with a set of available tools and documents, and the server handles the planning, tool execution, and synthesis internally. Your client code gets much simpler, and the behavior is more consistent because the orchestration logic is shared rather than reimplemented by every application.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="what-llama-stack-brings-to-the-table">What Llama Stack brings to the table<a href="https://llamastack.github.io/blog/responses-api#what-llama-stack-brings-to-the-table" class="hash-link" aria-label="Direct link to What Llama Stack brings to the table" title="Direct link to What Llama Stack brings to the table" translate="no">​</a></h2>
<p>Llama Stack is an open source server for building AI applications. It provides a unified set of APIs for inference, RAG, tool calling, safety, evaluation, and more, backed by a pluggable provider architecture that lets you swap components without changing application code.</p>
<p>Llama Stack implements the Responses API with support for built-in RAG through <code>file_search</code>, automated multi-tool orchestration through the Model Context Protocol (MCP), conversation state management, and compatibility with the OpenAI client ecosystem.</p>
<p>But the interesting part is what Llama Stack adds beyond the API surface itself.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="model-freedom">Model freedom<a href="https://llamastack.github.io/blog/responses-api#model-freedom" class="hash-link" aria-label="Direct link to Model freedom" title="Direct link to Model freedom" translate="no">​</a></h3>
<p>With a proprietary hosted service, the Responses API is tied to a specific set of models from a single provider. With Llama Stack, you can use any model accessible through its inference providers: open source models like the Llama family, fine-tuned models you've created yourself, or optimized models from the broader ecosystem. The same Responses API interface works regardless of which model backs it. You can start with a small model during development, scale up for production, or swap models entirely, and your application code stays the same.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="data-sovereignty">Data sovereignty<a href="https://llamastack.github.io/blog/responses-api#data-sovereignty" class="hash-link" aria-label="Direct link to Data sovereignty" title="Direct link to Data sovereignty" translate="no">​</a></h3>
<p>If you work in a regulated industry like finance, healthcare, or government, sending sensitive documents to a third-party cloud service is often a non-starter. Llama Stack lets you run the entire stack on your own infrastructure: the model, the vector store for RAG, and the tool execution environment. Documents stay within your security perimeter, and the agent's reasoning about those documents does too.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="open-extensible-architecture">Open, extensible architecture<a href="https://llamastack.github.io/blog/responses-api#open-extensible-architecture" class="hash-link" aria-label="Direct link to Open, extensible architecture" title="Direct link to Open, extensible architecture" translate="no">​</a></h3>
<p>Llama Stack's provider architecture means you are not locked into a single implementation for any component. Need FAISS for your vector store in development and Milvus in production? Change a configuration setting. Want to use Ollama locally and a cloud inference provider in production? Same application code, different distribution. This flexibility extends across the full Llama Stack API surface, not just inference.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="private-rag-with-file_search">Private RAG with <code>file_search</code><a href="https://llamastack.github.io/blog/responses-api#private-rag-with-file_search" class="hash-link" aria-label="Direct link to private-rag-with-file_search" title="Direct link to private-rag-with-file_search" translate="no">​</a></h2>
<p>Retrieval-augmented generation (RAG) grounds a model's responses in authoritative documents, which reduces hallucination and enables accurate answers from private knowledge bases.</p>
<p>The Responses API formalizes RAG with the <code>file_search</code> tool. You create a vector store, upload documents to it, and then include <code>file_search</code> as an available tool when calling the Responses API. The model generates search queries, retrieves relevant passages, and synthesizes them into a grounded answer, all in a single API call.</p>
<p>With Llama Stack, this entire pipeline runs on your infrastructure. Document ingestion, embedding, storage, retrieval, and synthesis all happen locally. The response includes references to the source passages, so your application can provide citations for verification.</p>
<p>This makes it practical to build RAG applications over sensitive internal documents like compliance policies, medical records, or proprietary research, with confidence that the data never leaves your environment.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="multi-tool-orchestration-with-mcp">Multi-tool orchestration with MCP<a href="https://llamastack.github.io/blog/responses-api#multi-tool-orchestration-with-mcp" class="hash-link" aria-label="Direct link to Multi-tool orchestration with MCP" title="Direct link to Multi-tool orchestration with MCP" translate="no">​</a></h2>
<p>The Responses API gets especially interesting when an agent needs to coordinate multiple tools to answer a complex question. Consider a question like: "What parks are in Rhode Island, and are there any upcoming events at them?" Answering this requires discovering available tools, searching for parks, querying events for each park found, and synthesizing all the results.</p>
<p>With Llama Stack's Responses API and MCP integration, this entire workflow happens within a single API call. The model discovers available tools from a connected MCP server, plans and executes a sequence of tool calls, and produces a consolidated answer. The client application doesn't need to write any orchestration logic.</p>
<p>MCP is an open standard for tool integration, so the ecosystem of available tools is broad and growing. Any MCP server can be connected to Llama Stack and used by the Responses API, whether it provides access to databases, internal services, or external data sources.</p>
<p>Llama Stack also provides fine-grained control over tool access. You can restrict which tools are available for a given request, pass per-request authentication headers to MCP servers so that an agent can only access data for the current user, and configure tool behavior without modifying the agent's prompt. This matters a lot in production deployments where security and access control are real concerns.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="framework-compatibility">Framework compatibility<a href="https://llamastack.github.io/blog/responses-api#framework-compatibility" class="hash-link" aria-label="Direct link to Framework compatibility" title="Direct link to Framework compatibility" translate="no">​</a></h2>
<p>Llama Stack exposes OpenAI-compatible endpoints at <code>/v1</code>, so you can use the official OpenAI Python client, the Llama Stack client, or any other client that speaks the OpenAI API. They all work the same way.</p>
<p>If you have existing code built with the OpenAI client, migrating to Llama Stack means pointing your client at your Llama Stack server. That's it. This also applies to frameworks like LangChain that build on top of OpenAI's API. Switching the inference backend to Llama Stack requires changing a constructor parameter, not rewriting your agent logic.</p>
<p>This drop-in compatibility has practical implications beyond convenience. You can develop and test against a local Llama Stack server, deploy against a production Llama Stack distribution, or switch between Llama Stack and other OpenAI-compatible providers, all with the same application code.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="toward-an-open-standard-open-responses">Toward an open standard: Open Responses<a href="https://llamastack.github.io/blog/responses-api#toward-an-open-standard-open-responses" class="hash-link" aria-label="Direct link to Toward an open standard: Open Responses" title="Direct link to Toward an open standard: Open Responses" translate="no">​</a></h2>
<p>When Llama Stack first implemented the Responses API, the specification was proprietary. Llama Stack had to track a moving target, and there was always a gap between when OpenAI added a feature and when Llama Stack could implement it.</p>
<p>The <a href="https://www.openresponses.org/" target="_blank" rel="noopener noreferrer" class="">Open Responses specification</a> changes this. Open Responses is an open source specification backed by a broad community including OpenAI, Hugging Face, and providers like Ollama, vLLM, and LM Studio. It formalizes the core concepts of the Responses API into an open standard: items as the atomic unit of context, semantic streaming events, and the agentic loop of reasoning and tool invocation.</p>
<p>For Llama Stack, Open Responses provides a stable, community-governed specification to build against rather than a proprietary one. It also means that Llama Stack's Responses API implementation is part of a broader ecosystem of interoperable providers. Applications built against the Open Responses specification can run on Llama Stack, on OpenAI, on Hugging Face's infrastructure, or on local providers like Ollama, without code changes.</p>
<p>The Open Responses specification also introduces concepts that matter for production deployments:</p>
<ul>
<li class=""><strong>Reasoning visibility:</strong> The specification formalizes how models expose their reasoning process, which enables audit trails and governance workflows.</li>
<li class=""><strong>Internal vs. external tools:</strong> A clear distinction between tools executed within the provider's infrastructure (like <code>file_search</code>) and tools executed by the client, so developers know exactly where computation happens.</li>
<li class=""><strong>Extensibility without fragmentation:</strong> Providers can add custom capabilities while maintaining a stable, interoperable core.</li>
</ul>
<p>For the Llama Stack community, this means that investing in the Responses API is about more than compatibility with one vendor. It's about building on an open standard that the industry is starting to converge around.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="getting-started">Getting started<a href="https://llamastack.github.io/blog/responses-api#getting-started" class="hash-link" aria-label="Direct link to Getting started" title="Direct link to Getting started" translate="no">​</a></h2>
<p>If you're new to Llama Stack, the <a href="https://llamastack.github.io/docs/getting_started/" target="_blank" rel="noopener noreferrer" class="">Getting Started guide</a> will walk you through setting up a server with your preferred inference provider. From there, the <a href="https://llamastack.github.io/docs/providers/openai" target="_blank" rel="noopener noreferrer" class="">OpenAI Implementation Guide</a> has examples of using the Responses API for everything from simple text generation to multi-tool agentic workflows.</p>
<p>The Responses API is still evolving, both in Llama Stack and in the Open Responses specification, and contributions are welcome. Whether it's implementing new features, improving test coverage, or reporting issues, the project benefits from developers who are building real applications and sharing what they learn.</p>]]></content>
        <author>
            <name>Bill Murdock</name>
            <uri>https://github.com/jwm4</uri>
        </author>
        <category label="responses-api" term="responses-api"/>
        <category label="agents" term="agents"/>
        <category label="rag" term="rag"/>
        <category label="mcp" term="mcp"/>
        <category label="open-responses" term="open-responses"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Building a Self-Improving Agent with Llama Stack]]></title>
        <id>https://llamastack.github.io/blog/building-agentic-flows-with-conversations-and-responses</id>
        <link href="https://llamastack.github.io/blog/building-agentic-flows-with-conversations-and-responses"/>
        <updated>2026-03-01T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[What if your AI agent could improve itself? Most agent tutorials show a single loop — user asks a question, the agent calls some tools, returns an answer. But what happens when you need to systematically improve your agent's behavior over time?]]></summary>
        <content type="html"><![CDATA[<p>What if your AI agent could improve itself? Most agent tutorials show a single loop — user asks a question, the agent calls some tools, returns an answer. But what happens when you need to systematically improve your agent's behavior over time?</p>
<p>In this post, we build a <strong>ResearchAgent</strong> that answers questions from an internal engineering knowledge base — and gets better at it automatically. The agent uses the Responses API agentic loop with <code>file_search</code> and client-side tools to research questions, and it owns its own system prompt. Every N calls, it benchmarks itself by using a different model to judge the results, and rewrites its own prompt via the Prompts API.</p>
<p>This is literally self-referential: <strong>a Llama Stack agent evaluating and improving itself</strong> using the Responses API, Prompts API, and Vector Stores as its toolkit.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="what-were-building">What We're Building<a href="https://llamastack.github.io/blog/building-agentic-flows-with-conversations-and-responses#what-were-building" class="hash-link" aria-label="Direct link to What We're Building" title="Direct link to What We're Building" translate="no">​</a></h2>
<p>A single <code>ResearchAgent</code> class that does two things:</p>
<ol>
<li class=""><strong>Research</strong> (agentic): Uses the Responses API <code>while True</code> loop with server-side <code>file_search</code> and client-side function tools (<code>read_local_file</code>, <code>index_document</code>, <code>list_local_files</code>). The agent decides what to search, discovers unindexed local files, reads them, indexes the relevant ones, and searches again with the enriched knowledge base.</li>
<li class=""><strong>Self-improvement</strong> (deterministic): Every N calls to <code>research()</code>, the agent runs <code>evaluate_self()</code> to benchmark against test cases and <code>improve_self()</code> to rewrite its own system prompt. This is a fixed sequence — no LLM-driven tool selection, just the agent measuring and improving its own performance.</li>
</ol>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">┌──────────────────────────────────────────────────────────┐</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│  ResearchAgent                                           │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│                                                          │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│  research(question)                                      │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│    Responses API agentic loop (while True):              │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│      Server-side: file_search → Vector Store             │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│      Client-side: read_local_file, index_document,       │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│                   list_local_files                       │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│    Increments call counter; triggers self-improvement    │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│    every N calls                                         │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│                                                          │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│  evaluate_self()                                         │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│    Run all test cases → judge answers (Responses API)    │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│    → log scores (SQLite ledger)                          │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│                                                          │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│  improve_self()                                          │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│    Read feedback → propose new prompt (Responses API)    │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">│    → save new version (Prompts API)                      │</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">└──────────────────────────────────────────────────────────┘</span><br></span></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="prerequisites">Prerequisites<a href="https://llamastack.github.io/blog/building-agentic-flows-with-conversations-and-responses#prerequisites" class="hash-link" aria-label="Direct link to Prerequisites" title="Direct link to Prerequisites" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://ollama.com/" target="_blank" rel="noopener noreferrer" class="">Ollama</a> running locally with two models pulled: <code>llama3.1:8b</code> for the research agent and <code>gpt-oss:20b</code> as the judge</li>
<li class="">A Llama Stack server using the starter distribution, pointed at Ollama via the <code>OLLAMA_URL</code> environment variable</li>
<li class="">Python SDK: <code>uv pip install llama-stack-client</code></li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="the-research-loop">The Research Loop<a href="https://llamastack.github.io/blog/building-agentic-flows-with-conversations-and-responses#the-research-loop" class="hash-link" aria-label="Direct link to The Research Loop" title="Direct link to The Research Loop" translate="no">​</a></h2>
<p>The research agent is the heart of the system — and the showcase for the Responses API agentic pattern. Unlike a simple single-call RAG agent, it has real decisions to make: the vector store might not have enough context, so the agent can discover local files, read them, index the relevant ones, and search again.</p>
<p>It has one server-side tool and three client-side function tools:</p>
<ul>
<li class=""><strong><code>file_search</code></strong> (server-side): Searches the vector store for relevant documents. The Responses API executes this automatically — no client code needed.</li>
<li class=""><strong><code>read_local_file(path)</code></strong>: Reads an unindexed local file (e.g., a newly written postmortem not yet in the knowledge base).</li>
<li class=""><strong><code>index_document(file_path)</code></strong>: Uploads a file to the vector store via the Files API and <code>vector_stores.files.create()</code>. This is the key insight: the agent actively curates the knowledge base.</li>
<li class=""><strong><code>list_local_files(directory)</code></strong>: Discovers available <code>.md</code> and <code>.txt</code> files in a directory.</li>
</ul>
<p>The internal <code>_run_query()</code> method is the standard Responses API agentic loop — keep calling <code>responses.create()</code> until the model stops emitting tool calls:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token keyword" style="color:hsl(286, 60%, 67%)">class</span><span class="token plain"> </span><span class="token class-name" style="color:hsl(29, 54%, 61%)">ResearchAgent</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">def</span><span class="token plain"> </span><span class="token function" style="color:hsl(207, 82%, 66%)">__init__</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> model</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> vector_store_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> prompt_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token operator" style="color:hsl(207, 82%, 66%)">**</span><span class="token plain">kwargs</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">client </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">model </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> model</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">vector_store_id </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> vector_store_id</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompt_id </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> prompt_id  </span><span class="token comment" style="color:hsl(220, 10%, 40%)"># The agent owns its prompt</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">_call_count </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> </span><span class="token number" style="color:hsl(29, 54%, 61%)">0</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">_tools </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"read_local_file"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">_read_local_file</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"index_document"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">_index_document</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"list_local_files"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">_list_local_files</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Also accepts: judge_model, ledger, test_cases, optimize_every</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">def</span><span class="token plain"> </span><span class="token function" style="color:hsl(207, 82%, 66%)">_run_query</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> question</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> system_prompt</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token triple-quoted-string string" style="color:hsl(95, 38%, 62%)">"""Agentic loop: search, read local files, index, repeat."""</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        inputs </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> question</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        tools </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">_tool_schemas</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">while</span><span class="token plain"> </span><span class="token boolean" style="color:hsl(29, 54%, 61%)">True</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            response </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">responses</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">model</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                </span><span class="token builtin" style="color:hsl(95, 38%, 62%)">input</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">inputs</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                instructions</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">system_prompt</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                tools</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">tools</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                stream</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token boolean" style="color:hsl(29, 54%, 61%)">False</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token comment" style="color:hsl(220, 10%, 40%)"># file_search is handled server-side; collect client-side calls</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            function_calls </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token plain">o </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">for</span><span class="token plain"> o </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">in</span><span class="token plain"> response</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">output </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">if</span><span class="token plain"> o</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">type</span><span class="token plain"> </span><span class="token operator" style="color:hsl(207, 82%, 66%)">==</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"function_call"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">if</span><span class="token plain"> </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">not</span><span class="token plain"> function_calls</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">return</span><span class="token plain"> response</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">output_text  </span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Done — no more tool calls</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Execute each function call and feed results back</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            inputs </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">for</span><span class="token plain"> fc </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">in</span><span class="token plain"> function_calls</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                result </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">_tools</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token plain">fc</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">name</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token operator" style="color:hsl(207, 82%, 66%)">**</span><span class="token plain">json</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">loads</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">fc</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">arguments</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                inputs</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">append</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">fc</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                inputs</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">append</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                    </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                        </span><span class="token string" style="color:hsl(95, 38%, 62%)">"type"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"function_call_output"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                        </span><span class="token string" style="color:hsl(95, 38%, 62%)">"call_id"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> fc</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">call_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                        </span><span class="token string" style="color:hsl(95, 38%, 62%)">"output"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> result</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                    </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<p>The public <code>research()</code> method reads the agent's current prompt, runs the agentic loop, and increments a counter. Every N calls, it triggers self-improvement:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token keyword" style="color:hsl(286, 60%, 67%)">class</span><span class="token plain"> </span><span class="token class-name" style="color:hsl(29, 54%, 61%)">ResearchAgent</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">def</span><span class="token plain"> </span><span class="token function" style="color:hsl(207, 82%, 66%)">research</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> question</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token triple-quoted-string string" style="color:hsl(95, 38%, 62%)">"""Answer a question.  Automatically self-improves every N calls."""</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        current </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompts</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">retrieve</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompt_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        answer </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">_run_query</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">question</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> current</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompt</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">_call_count </span><span class="token operator" style="color:hsl(207, 82%, 66%)">+=</span><span class="token plain"> </span><span class="token number" style="color:hsl(29, 54%, 61%)">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">if</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">test_cases </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">and</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">_call_count </span><span class="token operator" style="color:hsl(207, 82%, 66%)">%</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">optimize_every </span><span class="token operator" style="color:hsl(207, 82%, 66%)">==</span><span class="token plain"> </span><span class="token number" style="color:hsl(29, 54%, 61%)">0</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">evaluate_self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">improve_self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">return</span><span class="token plain"> answer</span><br></span></code></pre></div></div>
<p>In a typical call, the agent searches the vector store via <code>file_search</code> (handled server-side). If the retrieved context is insufficient — say, a question about a recent outage whose postmortem hasn't been indexed yet — the agent calls <code>list_local_files</code> to discover available documents, <code>read_local_file</code> to inspect the relevant one, and <code>index_document</code> to add it to the vector store. Then it searches again with the enriched store and writes its final answer.</p>
<p>The <code>index_document</code> tool is worth highlighting — it's the agent actively curating its own knowledge base:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token keyword" style="color:hsl(286, 60%, 67%)">class</span><span class="token plain"> </span><span class="token class-name" style="color:hsl(29, 54%, 61%)">ResearchAgent</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">def</span><span class="token plain"> </span><span class="token function" style="color:hsl(207, 82%, 66%)">_index_document</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> file_path</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token triple-quoted-string string" style="color:hsl(95, 38%, 62%)">"""Upload a local file to the vector store so it becomes searchable."""</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token builtin" style="color:hsl(95, 38%, 62%)">file</span><span class="token plain"> </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">files</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token builtin" style="color:hsl(95, 38%, 62%)">file</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">open</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">file_path</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"rb"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> purpose</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"assistants"</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        attach </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">vector_stores</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">files</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            vector_store_id</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">vector_store_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> file_id</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">file</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">id</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">while</span><span class="token plain"> attach</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">status </span><span class="token operator" style="color:hsl(207, 82%, 66%)">==</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"in_progress"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            time</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">sleep</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token number" style="color:hsl(29, 54%, 61%)">0.5</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            attach </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">vector_stores</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">files</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">retrieve</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                vector_store_id</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">vector_store_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> file_id</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">file</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token builtin" style="color:hsl(95, 38%, 62%)">id</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">return</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">f"Indexed </span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string-interpolation interpolation">file_path</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)"> (file_id=</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string-interpolation interpolation builtin" style="color:hsl(95, 38%, 62%)">file</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token string-interpolation interpolation builtin" style="color:hsl(95, 38%, 62%)">id</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">, status=</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string-interpolation interpolation">attach</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token string-interpolation interpolation">status</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">)"</span><br></span></code></pre></div></div>
<p>This uses the Files API to upload the document and <code>vector_stores.files.create()</code> to attach it to the store. After polling until indexing completes, the file is searchable by <code>file_search</code> in subsequent turns of the same query — or in future queries.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="self-improvement">Self-Improvement<a href="https://llamastack.github.io/blog/building-agentic-flows-with-conversations-and-responses#self-improvement" class="hash-link" aria-label="Direct link to Self-Improvement" title="Direct link to Self-Improvement" translate="no">​</a></h2>
<p>The self-improvement cycle is where the agent benchmarks itself, then rewrites its own prompt based on the feedback.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="evaluate_self">evaluate_self<a href="https://llamastack.github.io/blog/building-agentic-flows-with-conversations-and-responses#evaluate_self" class="hash-link" aria-label="Direct link to evaluate_self" title="Direct link to evaluate_self" translate="no">​</a></h3>
<p><code>evaluate_self</code> runs the agent on every test case using its current system prompt, judges each answer with the judge model, and logs scores to the ledger:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token keyword" style="color:hsl(286, 60%, 67%)">class</span><span class="token plain"> </span><span class="token class-name" style="color:hsl(29, 54%, 61%)">ResearchAgent</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">def</span><span class="token plain"> </span><span class="token function" style="color:hsl(207, 82%, 66%)">evaluate_self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token triple-quoted-string string" style="color:hsl(95, 38%, 62%)">"""Benchmark against test cases and log scores."""</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        current </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompts</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">retrieve</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompt_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        results </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">for</span><span class="token plain"> tc </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">in</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">test_cases</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            answer </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">_run_query</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">tc</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token string" style="color:hsl(95, 38%, 62%)">"question"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> current</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompt</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            judgment </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">responses</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">judge_model</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                </span><span class="token builtin" style="color:hsl(95, 38%, 62%)">input</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                    </span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">f"Score the following answer on a scale of 0.0 to 1.0.\n\n"</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                    </span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">f"Question: </span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string-interpolation interpolation">tc</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token string-interpolation interpolation string" style="color:hsl(95, 38%, 62%)">'question'</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">\n"</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                    </span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">f"Expected: </span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string-interpolation interpolation">tc</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token string-interpolation interpolation string" style="color:hsl(95, 38%, 62%)">'expected'</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">\nActual: </span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string-interpolation interpolation">answer</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">\n\n"</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                    </span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">f'Respond with JSON: {{"score": &lt;float&gt;, "reasoning": "..."}}'</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                stream</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token boolean" style="color:hsl(29, 54%, 61%)">False</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            score_data </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> json</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">loads</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">judgment</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">output_text</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            results</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">append</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token operator" style="color:hsl(207, 82%, 66%)">**</span><span class="token plain">tc</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"actual"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> answer</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token operator" style="color:hsl(207, 82%, 66%)">**</span><span class="token plain">score_data</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        avg_score </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(95, 38%, 62%)">sum</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">r</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token string" style="color:hsl(95, 38%, 62%)">"score"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token plain"> </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">for</span><span class="token plain"> r </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">in</span><span class="token plain"> results</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"> </span><span class="token operator" style="color:hsl(207, 82%, 66%)">/</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(95, 38%, 62%)">len</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">results</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">ledger</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">log</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompt_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> current</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">version</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> avg_score</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> feedback</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">return</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string" style="color:hsl(95, 38%, 62%)">"results"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> results</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"average_score"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> avg_score</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"feedback"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> feedback</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="improve_self">improve_self<a href="https://llamastack.github.io/blog/building-agentic-flows-with-conversations-and-responses#improve_self" class="hash-link" aria-label="Direct link to improve_self" title="Direct link to improve_self" translate="no">​</a></h3>
<p><code>improve_self</code> reads the latest evaluation feedback from the ledger and uses the judge model to generate an improved system prompt, then saves it via the Prompts API:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token keyword" style="color:hsl(286, 60%, 67%)">class</span><span class="token plain"> </span><span class="token class-name" style="color:hsl(29, 54%, 61%)">ResearchAgent</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">def</span><span class="token plain"> </span><span class="token function" style="color:hsl(207, 82%, 66%)">improve_self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token triple-quoted-string string" style="color:hsl(95, 38%, 62%)">"""Propose and save an improved system prompt."""</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        history </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">ledger</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">history</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompt_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        latest </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> history</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token operator" style="color:hsl(207, 82%, 66%)">-</span><span class="token number" style="color:hsl(29, 54%, 61%)">1</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        current </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompts</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">retrieve</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompt_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        response </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">responses</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">judge_model</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token builtin" style="color:hsl(95, 38%, 62%)">input</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                </span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">f"Improve this research agent's system prompt based on feedback.\n\n"</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                </span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">f"Current prompt:\n</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string-interpolation interpolation">current</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token string-interpolation interpolation">prompt</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">\n\n"</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                </span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">f"Feedback:\n</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string-interpolation interpolation">latest</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token string-interpolation interpolation string" style="color:hsl(95, 38%, 62%)">'reasoning'</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">\n\n"</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">                </span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">f"Return ONLY the improved prompt text."</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            stream</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token boolean" style="color:hsl(29, 54%, 61%)">False</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        new_prompt </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> response</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">output_text</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">strip</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompts</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">update</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompt_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> prompt</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">new_prompt</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> version</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">current</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">version</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<p>The judge model does double duty — scoring answers <em>and</em> proposing improvements based on its own feedback. The Prompts API auto-increments versions on each <code>update()</code>, and the <code>version</code> parameter provides optimistic locking so concurrent experiments don't silently overwrite each other.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="optimize">optimize<a href="https://llamastack.github.io/blog/building-agentic-flows-with-conversations-and-responses#optimize" class="hash-link" aria-label="Direct link to optimize" title="Direct link to optimize" translate="no">​</a></h3>
<p>For initial tuning (before the agent starts serving real queries), <code>optimize</code> runs the evaluate/improve cycle in a <code>for</code> loop:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token keyword" style="color:hsl(286, 60%, 67%)">class</span><span class="token plain"> </span><span class="token class-name" style="color:hsl(29, 54%, 61%)">ResearchAgent</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">def</span><span class="token plain"> </span><span class="token function" style="color:hsl(207, 82%, 66%)">optimize</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"> max_iterations</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token number" style="color:hsl(29, 54%, 61%)">5</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token triple-quoted-string string" style="color:hsl(95, 38%, 62%)">"""Run the evaluate/improve cycle for N iterations."""</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">for</span><span class="token plain"> iteration </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">in</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(95, 38%, 62%)">range</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">max_iterations</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">evaluate_self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">improve_self</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="running-it">Running It<a href="https://llamastack.github.io/blog/building-agentic-flows-with-conversations-and-responses#running-it" class="hash-link" aria-label="Direct link to Running It" title="Direct link to Running It" translate="no">​</a></h2>
<p>First, pull the models and start Ollama, then run the Llama Stack starter distribution pointing at it:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">ollama pull llama3.1:8b</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">ollama pull gpt-oss:20b</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OLLAMA_URL</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">http://localhost:11434/v1 uv run </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--with</span><span class="token plain"> llama-stack llama stack run starter</span><br></span></code></pre></div></div>
<p>The <code>OLLAMA_URL</code> environment variable tells the starter distribution to use Ollama as its inference provider. The server starts on <code>http://localhost:8321</code> by default.</p>
<p>Then create the agent with some engineering documents. Some docs are indexed in the vector store up front; others live in a local directory for the agent to discover and index on demand:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token keyword" style="color:hsl(286, 60%, 67%)">from</span><span class="token plain"> llama_stack_client </span><span class="token keyword" style="color:hsl(286, 60%, 67%)">import</span><span class="token plain"> LlamaStackClient</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">client </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> LlamaStackClient</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">base_url</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"http://localhost:8321"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Create the initial system prompt</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">initial </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompts</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">create</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    prompt</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"You are a helpful assistant. Answer questions based on the provided context."</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Create the self-improving research agent</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">agent </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> ResearchAgent</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">from_files</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    client</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"ollama/llama3.1:8b"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    name</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"engineering-kb"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    file_paths</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token string" style="color:hsl(95, 38%, 62%)">"docs/blog/building-agentic-flows/design/user_service_v2.md"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token string" style="color:hsl(95, 38%, 62%)">"docs/blog/building-agentic-flows/runbooks/deployment_rollback.md"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    prompt_id</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">initial</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">prompt_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    local_docs_dir</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"docs/blog/building-agentic-flows/postmortems"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    judge_model</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token string" style="color:hsl(95, 38%, 62%)">"ollama/gpt-oss:20b"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    ledger</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">ScoreLedger</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    test_cases</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"question"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"What is the deployment rollback procedure?"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"expected"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"Revert the Kubernetes deployment to the previous revision using kubectl rollout undo"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"question"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"What authentication method does the user service use?"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"expected"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"JWT tokens issued by the auth gateway with RS256 signing"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"question"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"What was the root cause of the 2025-02 checkout outage?"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">            </span><span class="token string" style="color:hsl(95, 38%, 62%)">"expected"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"Connection pool exhaustion in the payments service due to missing timeout configuration"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">        </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">    optimize_every</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token number" style="color:hsl(29, 54%, 61%)">10</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Run an initial optimization pass</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">agent</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">optimize</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token plain">max_iterations</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token number" style="color:hsl(29, 54%, 61%)">5</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Show the best prompt</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">result </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> agent</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">best_prompt</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token keyword" style="color:hsl(286, 60%, 67%)">print</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">f"Best prompt (v</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string-interpolation interpolation">result</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token string-interpolation interpolation string" style="color:hsl(95, 38%, 62%)">'version'</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">, score=</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string-interpolation interpolation">result</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token string-interpolation interpolation string" style="color:hsl(95, 38%, 62%)">'score'</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token string-interpolation interpolation format-spec">.2f</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">):"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token keyword" style="color:hsl(286, 60%, 67%)">print</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">f"  </span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string-interpolation interpolation">result</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">[</span><span class="token string-interpolation interpolation string" style="color:hsl(95, 38%, 62%)">'prompt'</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">]</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token comment" style="color:hsl(220, 10%, 40%)"># Normal usage — the agent self-improves every 10 research() calls</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">answer </span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain"> agent</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">.</span><span class="token plain">research</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token string" style="color:hsl(95, 38%, 62%)">"What is the deployment rollback procedure?"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token keyword" style="color:hsl(286, 60%, 67%)">print</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">(</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">f"Agent says: </span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token string-interpolation interpolation">answer</span><span class="token string-interpolation interpolation punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token string-interpolation string" style="color:hsl(95, 38%, 62%)">"</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">)</span><br></span></code></pre></div></div>
<p>The full implementation with tool schema generation and all supporting code is available at <a href="https://llamastack.github.io/assets/files/self_improving_agent-fba87d1a8b252b7b662cae8f19ec48fe.py" target="_blank" class="">self_improving_agent.py</a>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="how-it-works-under-the-hood">How It Works Under the Hood<a href="https://llamastack.github.io/blog/building-agentic-flows-with-conversations-and-responses#how-it-works-under-the-hood" class="hash-link" aria-label="Direct link to How It Works Under the Hood" title="Direct link to How It Works Under the Hood" translate="no">​</a></h2>
<p>The agent uses <em>both</em> kinds of Responses API tools for research:</p>
<ul>
<li class=""><strong>Server-side tools</strong> like <code>file_search</code> are executed automatically — the Responses API searches the vector store, retrieves relevant chunks, and feeds them to the model without any client code. This is what makes knowledge base search a single API call.</li>
<li class=""><strong>Client-side function tools</strong> (<code>read_local_file</code>, <code>index_document</code>, <code>list_local_files</code>) return tool call objects for you to execute. The <code>while True</code> loop dispatches these, and the results feed back into the next <code>responses.create()</code> call. This is what lets the agent actively curate its knowledge base.</li>
</ul>
<p>The agent combines both in a single loop: <code>file_search</code> results come back automatically within the response, while function calls need client-side execution. The model sees both sources of information and decides what to do next.</p>
<p>The self-improvement methods don't need any of this machinery. They call <code>responses.create()</code> directly for judging and prompt generation — no tool calling, no agentic loop. The Prompts API stores versioned prompt text with optimistic locking, and the SQLite ledger tracks how well each version performed. The <code>research()</code> counter ties it all together: the agent serves queries normally, and every N calls it pauses to evaluate and improve itself.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="whats-next">What's Next<a href="https://llamastack.github.io/blog/building-agentic-flows-with-conversations-and-responses#whats-next" class="hash-link" aria-label="Direct link to What's Next" title="Direct link to What's Next" translate="no">​</a></h2>
<p>The pattern here — a self-improving agent that benchmarks and rewrites its own prompt — generalizes well beyond research assistants:</p>
<ul>
<li class=""><strong>MCP tools</strong> for connecting to external services (databases, APIs, code execution sandboxes) — the research agent could pull in live data alongside static documents</li>
<li class=""><strong>Web search</strong> alongside <code>file_search</code> for agents that combine local knowledge with live web results</li>
<li class=""><strong>Multiple research agents</strong> with different vector stores, each self-improving independently and specializing in a different knowledge domain</li>
</ul>
<p>To learn more:</p>
<ul>
<li class=""><a class="" href="https://llamastack.github.io/docs/building_applications/responses_vs_agents">Responses API documentation</a></li>
<li class=""><a class="" href="https://llamastack.github.io/docs/api-openai/conformance#conversations">Conversations API documentation</a></li>
<li class=""><a class="" href="https://llamastack.github.io/docs/api-openai">OpenAI API compatibility</a></li>
<li class=""><a class="" href="https://llamastack.github.io/docs/building_applications/rag">Vector Stores documentation</a></li>
<li class=""><a href="https://discord.gg/llama-stack" target="_blank" rel="noopener noreferrer" class="">Join our Discord</a></li>
</ul>]]></content>
        <author>
            <name>Raghotham Murthy</name>
            <uri>https://github.com/raghotham</uri>
        </author>
        <category label="agents" term="agents"/>
        <category label="responses-api" term="responses-api"/>
        <category label="conversations" term="conversations"/>
        <category label="prompts" term="prompts"/>
        <category label="tutorial" term="tutorial"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[How to Get Started with Llama Stack]]></title>
        <id>https://llamastack.github.io/blog/how-to-get-started-with-llama-stack</id>
        <link href="https://llamastack.github.io/blog/how-to-get-started-with-llama-stack"/>
        <updated>2026-01-30T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[There is no shortage of GenAI hosted services like OpenAI, Gemini, and Bedrock.]]></summary>
        <content type="html"><![CDATA[<p>There is no shortage of GenAI hosted services like OpenAI, Gemini, and Bedrock.</p>
<p>Often, these services require tailoring your GenAI application directly to them, requiring developers to consider things that have nothing to do with their applications. Llama Stack is an open source project aiming to standardize and offer a set of APIs for AI applications that stay the same, regardless of the backend services being provided via those APIs.</p>
<p>Llama Stack’s APIs allow for a variety of use cases from running inference with Ollama on your laptop to a self-managed GPU system running inference with vLLM to a pure SaaS-based solution like Vertex. The standardized set of APIs each have providers that follow the same REST API implementation. An admin of the stack can specify which provider they want for each API and expose the REST API to users who get the same frontend experience regardless of the provider. This can allow you to run a single API surface layer using whatever Inference, Vector IO, or other solutions you may want while keeping your GenAI applications simple.</p>
<p>A Llama Stack is defined by its <code>config.yaml</code> file which holds key information like which APIs you want to expose, which providers you want to initialize for those APIs, their configuration, and more. Llama Stack features a CLI that allows you to launch and manage servers, either run locally on your machine or in a container!</p>
<p>Here is a sample portion of a <code>config.yaml</code>:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token key atrule" style="color:hsl(29, 54%, 61%)">version</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> </span><span class="token number" style="color:hsl(29, 54%, 61%)">2</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">distro_name</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> starter</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">apis</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">-</span><span class="token plain"> agents</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">-</span><span class="token plain"> batches</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">-</span><span class="token plain"> datasetio</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">-</span><span class="token plain"> eval</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">-</span><span class="token plain"> files</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">-</span><span class="token plain"> inference</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">-</span><span class="token plain"> post_training</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">-</span><span class="token plain"> safety</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">-</span><span class="token plain"> scoring</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">-</span><span class="token plain"> tool_runtime</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">-</span><span class="token plain"> vector_io</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">providers</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">inference</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">-</span><span class="token plain"> </span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">provider_id</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> $</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token plain">env.OLLAMA_URL</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain">+ollama</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">provider_type</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> remote</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain">ollama</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">config</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token key atrule" style="color:hsl(29, 54%, 61%)">base_url</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain"> $</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">{</span><span class="token plain">env.OLLAMA_URL</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain">=http</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain">//localhost</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">:</span><span class="token plain">11434/v1</span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">...</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span></code></pre></div></div>
<p>The current set of Llama Stack APIs can be found here: <a href="https://llamastack.github.io/docs/api-overview" target="_blank" rel="noopener noreferrer" class="">https://llamastack.github.io/docs/api-overview</a></p>
<p>All of these APIs, if set in the <code>config.yaml</code> which defines the stack to be stood up, will be available via a REST API with their initialized providers. Each API can have one or more providers and the <code>provider_id</code> can be specified at request time.</p>
<p>To get started quickly, all you need is Llama Stack, Ollama, and your favorite inference model! For this example, we are using <code>gpt-oss:20b</code>.</p>
<p>If you already have Ollama installed as a service, you can simply pull the model:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">ollama pull gpt-oss:20b</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">uv run </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--with</span><span class="token plain"> llama-stack llama stack list-deps </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--providers</span><span class="token plain"> </span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">inference</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">remote::ollama </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--format</span><span class="token plain"> uv </span><span class="token operator" style="color:hsl(207, 82%, 66%)">|</span><span class="token plain"> </span><span class="token function" style="color:hsl(207, 82%, 66%)">sh</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">uv run </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--with</span><span class="token plain"> llama-stack llama stack run </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--providers</span><span class="token plain"> </span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">inference</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">remote::ollama</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span></code></pre></div></div>
<p>If you don't have Ollama running as a service, you can start it manually:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">ollama serve </span><span class="token operator" style="color:hsl(207, 82%, 66%)">&gt;</span><span class="token plain"> /dev/null </span><span class="token operator file-descriptor important" style="color:hsl(220, 14%, 71%);font-weight:bold">2</span><span class="token operator" style="color:hsl(207, 82%, 66%)">&gt;</span><span class="token file-descriptor important" style="color:hsl(220, 14%, 71%);font-weight:bold">&amp;1</span><span class="token plain"> </span><span class="token operator" style="color:hsl(207, 82%, 66%)">&amp;</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">ollama run gpt-oss:20b </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--keepalive</span><span class="token plain"> 60m </span><span class="token comment" style="color:hsl(220, 10%, 40%)"># you can exit this once the model is running due to --keepalive</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">uv run </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--with</span><span class="token plain"> llama-stack llama stack </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--providers</span><span class="token plain"> </span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">inference</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">remote::ollama </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--format</span><span class="token plain"> uv </span><span class="token operator" style="color:hsl(207, 82%, 66%)">|</span><span class="token plain"> </span><span class="token function" style="color:hsl(207, 82%, 66%)">sh</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">uv run </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--with</span><span class="token plain"> llama-stack llama stack run </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--providers</span><span class="token plain"> </span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">inference</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">remote::ollama</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span></code></pre></div></div>
<p>Now you have Ollama running with <code>gpt-oss:20b</code>, and Llama Stack running pointing to Ollama as the inference provider. This minimal setup is sufficient to connect to local Ollama and respond to <code>/v1/chat/completions</code> requests.</p>
<p>For a more feature-rich setup, you can use the starter distribution which gives you a full stack with additional APIs and providers:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">ollama serve </span><span class="token operator" style="color:hsl(207, 82%, 66%)">&gt;</span><span class="token plain"> /dev/null </span><span class="token operator file-descriptor important" style="color:hsl(220, 14%, 71%);font-weight:bold">2</span><span class="token operator" style="color:hsl(207, 82%, 66%)">&gt;</span><span class="token file-descriptor important" style="color:hsl(220, 14%, 71%);font-weight:bold">&amp;1</span><span class="token plain"> </span><span class="token operator" style="color:hsl(207, 82%, 66%)">&amp;</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">ollama run gpt-oss:20b </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--keepalive</span><span class="token plain"> 60m </span><span class="token comment" style="color:hsl(220, 10%, 40%)"># you can exit this once the model is running due to --keepalive</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">uv run </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--with</span><span class="token plain"> llama-stack llama stack list-deps starter </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--format</span><span class="token plain"> uv </span><span class="token operator" style="color:hsl(207, 82%, 66%)">|</span><span class="token plain"> </span><span class="token function" style="color:hsl(207, 82%, 66%)">sh</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token builtin class-name" style="color:hsl(29, 54%, 61%)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:hsl(207, 82%, 66%)">OLLAMA_URL</span><span class="token operator" style="color:hsl(207, 82%, 66%)">=</span><span class="token plain">http://localhost:11434/v1</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain">uv run </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">--with</span><span class="token plain"> llama-stack llama stack run starter</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span></code></pre></div></div>
<p>A sample chat completion request would look like this:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(220, 13%, 18%);--prism-color:hsl(220, 14%, 71%)"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="background-color:hsl(220, 13%, 18%);color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token function" style="color:hsl(207, 82%, 66%)">curl</span><span class="token plain"> </span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">-X</span><span class="token plain"> POST http://localhost:8321/v1/chat/completions </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">-H</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">"Content-Type: application/json"</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(220, 14%, 71%)">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain"></span><span class="token parameter variable" style="color:hsl(207, 82%, 66%)">-d</span><span class="token plain"> </span><span class="token string" style="color:hsl(95, 38%, 62%)">'{</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token string" style="color:hsl(95, 38%, 62%)">"model": "ollama/gpt-oss:20b",</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token string" style="color:hsl(95, 38%, 62%)">"messages": [{"role": "user", "content": "Hello!"}]</span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token string" style="color:hsl(95, 38%, 62%)">}'</span><span class="token plain"></span><br></span><span class="token-line" style="color:hsl(220, 14%, 71%);text-shadow:0 1px rgba(0, 0, 0, 0.3)"><span class="token plain" style="display:inline-block"></span><br></span></code></pre></div></div>
<p>Notice, the model name must be prefixed with the <code>provider_id</code> in order for the request to route properly! In this example, we are just utilizing the <code>/chat/completions</code> route in the Inference API. The starter distribution has a large amount of APIs and ready to use providers baked in. Example API requests, similar to the one above, for other APIs can be found in the <a href="https://llamastack.github.io/docs/api/llama-stack-specification" target="_blank" rel="noopener noreferrer" class="">Llama Stack API specification</a>. Take it for a spin and see what you can do with Llama Stack!</p>]]></content>
        <author>
            <name>Charlie Doern</name>
            <uri>https://github.com/cdoern</uri>
        </author>
        <author>
            <name>Nathan Weinberg</name>
            <uri>https://github.com/nathan-weinberg</uri>
        </author>
        <author>
            <name>Llama Stack Team</name>
            <uri>https://github.com/llamastack</uri>
        </author>
        <category label="introduction" term="introduction"/>
        <category label="how-to" term="how-to"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Introducing Llama Stack - The Open-Source Platform for Building AI Applications]]></title>
        <id>https://llamastack.github.io/blog/introducing-llama-stack</id>
        <link href="https://llamastack.github.io/blog/introducing-llama-stack"/>
        <updated>2026-01-22T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Welcome to our blog!]]></summary>
        <content type="html"><![CDATA[<p>Welcome to our blog!</p>
<p>We're excited to introduce you to <strong>Llama Stack</strong> - the open-source platform that simplifies building production-ready generative AI applications.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="what-is-llama-stack">What is Llama Stack?<a href="https://llamastack.github.io/blog/introducing-llama-stack#what-is-llama-stack" class="hash-link" aria-label="Direct link to What is Llama Stack?" title="Direct link to What is Llama Stack?" translate="no">​</a></h2>
<p>Llama Stack defines and standardizes the core building blocks needed to bring generative AI applications to market, centered on the <a href="https://www.openresponses.org/" target="_blank" rel="noopener noreferrer" class="">Open Responses specification</a>. By aligning with OpenAI’s open-sourced Responses API, Llama Stack provides a consistent, interoperable foundation for building agentic and generative systems. It offers a growing suite of open-source APIs—including prompts, conversations, files, models, embeddings, fine-tuning, and MCP—enabling seamless transitions from local development to production across providers and environments.</p>
<p>Think of Llama Stack as a universal interface that abstracts away the complexity of working with different AI tools and provider (e.g., vector databases, model inference providers, and deployment environments). Whether you're building locally, deploying on-premises, or scaling in the cloud, Llama Stack provides a consistent developer experience.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="key-features">Key Features<a href="https://llamastack.github.io/blog/introducing-llama-stack#key-features" class="hash-link" aria-label="Direct link to Key Features" title="Direct link to Key Features" translate="no">​</a></h2>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="unified-api-layer">Unified API Layer<a href="https://llamastack.github.io/blog/introducing-llama-stack#unified-api-layer" class="hash-link" aria-label="Direct link to Unified API Layer" title="Direct link to Unified API Layer" translate="no">​</a></h3>
<p>Llama Stack provides standardized APIs across six core capabilities:</p>
<ul>
<li class=""><strong>Inference</strong>: Run models locally or in the cloud with a consistent interface</li>
<li class=""><strong>Vector Stores</strong>: Build knowledge and agentic retrieval systems</li>
<li class=""><strong>Agents</strong>: Create intelligent agent flows with responses/conversations</li>
<li class=""><strong>Tools and MCP</strong>: Integrate with external tools and services directly or via MCP</li>
<li class=""><strong>Moderations</strong>: Built-in safety guardrails and content filtering via moderations api</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="plugin-architecture">Plugin Architecture<a href="https://llamastack.github.io/blog/introducing-llama-stack#plugin-architecture" class="hash-link" aria-label="Direct link to Plugin Architecture" title="Direct link to Plugin Architecture" translate="no">​</a></h3>
<p>The plugin architecture supports a rich ecosystem of API implementations across different environments:</p>
<ul>
<li class=""><strong>Local Development</strong>: Start with CPU-only setups for rapid iteration</li>
<li class=""><strong>On-Premises</strong>: Deploy in your own infrastructure</li>
<li class=""><strong>Cloud</strong>: Scale with hosted providers</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="prepackaged-distributions">Prepackaged Distributions<a href="https://llamastack.github.io/blog/introducing-llama-stack#prepackaged-distributions" class="hash-link" aria-label="Direct link to Prepackaged Distributions" title="Direct link to Prepackaged Distributions" translate="no">​</a></h3>
<p>Distributions are pre-configured bundles of provider implementations that make it easy to get started. You can begin with a local setup using Ollama and seamlessly transition to production with vLLM - all without changing your application code.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="multiple-developer-interfaces">Multiple Developer Interfaces<a href="https://llamastack.github.io/blog/introducing-llama-stack#multiple-developer-interfaces" class="hash-link" aria-label="Direct link to Multiple Developer Interfaces" title="Direct link to Multiple Developer Interfaces" translate="no">​</a></h3>
<p>Llama Stack supports various developer interfaces:</p>
<ul>
<li class=""><strong>CLI</strong>: Command-line tools for server management</li>
<li class=""><strong>Python SDK</strong>: <a href="https://github.com/meta-llama/llama-stack-client-python" target="_blank" rel="noopener noreferrer" class=""><code>llama-stack-client-python</code></a></li>
<li class=""><strong>TypeScript SDK</strong>: <a href="https://github.com/meta-llama/llama-stack-client-typescript" target="_blank" rel="noopener noreferrer" class=""><code>llama-stack-client-typescript</code></a></li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="why-llama-stack">Why Llama Stack?<a href="https://llamastack.github.io/blog/introducing-llama-stack#why-llama-stack" class="hash-link" aria-label="Direct link to Why Llama Stack?" title="Direct link to Why Llama Stack?" translate="no">​</a></h2>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="flexibility-without-compromise">Flexibility Without Compromise<a href="https://llamastack.github.io/blog/introducing-llama-stack#flexibility-without-compromise" class="hash-link" aria-label="Direct link to Flexibility Without Compromise" title="Direct link to Flexibility Without Compromise" translate="no">​</a></h3>
<p>Developers can choose their preferred infrastructure without changing APIs. This means you can:</p>
<ul>
<li class="">Start locally for development</li>
<li class="">Test with different providers</li>
<li class="">Deploy to production with your chosen infrastructure</li>
<li class="">Switch providers as your needs evolve</li>
</ul>
<p>All while maintaining the same codebase and APIs.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="consistent-experience">Consistent Experience<a href="https://llamastack.github.io/blog/introducing-llama-stack#consistent-experience" class="hash-link" aria-label="Direct link to Consistent Experience" title="Direct link to Consistent Experience" translate="no">​</a></h3>
<p>With unified APIs, Llama Stack makes it easier to:</p>
<ul>
<li class="">Build applications with consistent behavior</li>
<li class="">Test across different environments</li>
<li class="">Deploy with confidence</li>
<li class="">Maintain and update your codebase</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="robust-ecosystem">Robust Ecosystem<a href="https://llamastack.github.io/blog/introducing-llama-stack#robust-ecosystem" class="hash-link" aria-label="Direct link to Robust Ecosystem" title="Direct link to Robust Ecosystem" translate="no">​</a></h3>
<p>Llama Stack integrates with distribution partners including:</p>
<ul>
<li class=""><strong>Cloud Providers</strong>: AWS Bedrock, Together, Fireworks, and more</li>
<li class=""><strong>Hardware Vendors</strong>: NVIDIA, Cerebras, SambaNova</li>
<li class=""><strong>Vector Databases</strong>: ChromaDB, Milvus, Qdrant, Weaviate, PostgreSQL, ElasticSearch</li>
<li class=""><strong>AI Companies</strong>: OpenAI, Anthropic, Google Gemini</li>
</ul>
<p>For a complete list, check out our <a class="" href="https://llamastack.github.io/docs/providers">Providers Documentation</a>.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="how-it-works">How It Works<a href="https://llamastack.github.io/blog/introducing-llama-stack#how-it-works" class="hash-link" aria-label="Direct link to How It Works" title="Direct link to How It Works" translate="no">​</a></h2>
<p>Llama Stack consists of two main components:</p>
<ol>
<li class=""><strong>Server</strong>: A server with pluggable API providers that can run in various environments</li>
<li class=""><strong>Client SDKs</strong>: Libraries for your applications to interact with the server</li>
</ol>
<p>The server handles all the complexity of managing different providers, while the client SDKs provide a simple, consistent interface for your application code.</p>
<p>Refer to the <a href="https://llamastack.github.io/docs/getting_started/quickstart" target="_blank" rel="noopener noreferrer" class="">Quick Start Guide</a> to get started building your first AI application with Llama Stack.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="whats-next">What's Next?<a href="https://llamastack.github.io/blog/introducing-llama-stack#whats-next" class="hash-link" aria-label="Direct link to What's Next?" title="Direct link to What's Next?" translate="no">​</a></h2>
<p>See the <a href="https://docs.google.com/document/d/1it-OsGFgAIwAUctQRQ-j1CBxFHhvSm530YR67eYGW1I/edit?tab=t.4uf22mux1a94" target="_blank" rel="noopener noreferrer" class="">Llama Stack Office Hours Content Calendar</a> for upcoming topics and the blog roadmap.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="join-the-community">Join the Community<a href="https://llamastack.github.io/blog/introducing-llama-stack#join-the-community" class="hash-link" aria-label="Direct link to Join the Community" title="Direct link to Join the Community" translate="no">​</a></h2>
<p>We'd love to have you join our growing community:</p>
<ul>
<li class=""><a href="https://github.com/llamastack/llama-stack" target="_blank" rel="noopener noreferrer" class="">Star us on GitHub</a></li>
<li class=""><a href="https://discord.gg/llama-stack" target="_blank" rel="noopener noreferrer" class="">Join our Discord</a></li>
<li class=""><a class="" href="https://llamastack.github.io/docs">Read the Documentation</a></li>
<li class=""><a href="https://github.com/llamastack/llama-stack/issues" target="_blank" rel="noopener noreferrer" class="">Report Issues</a></li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="conclusion">Conclusion<a href="https://llamastack.github.io/blog/introducing-llama-stack#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>Llama Stack is designed to make building AI applications simpler, more flexible, and more maintainable. By providing unified APIs and a rich ecosystem of providers, we're enabling developers to focus on what matters most - building great applications.</p>
<p>Whether you're just getting started with AI or building production systems at scale, Llama Stack has something to offer. We're excited to see what you'll build!</p>]]></content>
        <author>
            <name>Llama Stack Team</name>
            <uri>https://github.com/llamastack</uri>
        </author>
        <category label="announcement" term="announcement"/>
        <category label="introduction" term="introduction"/>
        <category label="getting-started" term="getting-started"/>
    </entry>
</feed>