Skip to main content

Detailed Tutorial

Beyond the Quickstart

This tutorial assumes you've completed the Quickstart and have a running Llama Stack server. We'll build progressively more complex applications using the standard OpenAI SDK.

Chat Completions

The most basic usage - a simple chat conversation:

chat.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")

# List available models
models = client.models.list()
for m in models.data:
print(f" {m.id} ({m.object})")

# Simple chat
response = client.chat.completions.create(
model="ollama/llama3.2:3b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about open source"},
],
)
print(response.choices[0].message.content)

Streaming

Stream responses token by token for a more interactive experience:

stream.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")

stream = client.chat.completions.create(
model="ollama/llama3.2:3b",
messages=[{"role": "user", "content": "Explain RAG in 3 sentences"}],
stream=True,
)

for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()

Responses API with Tool Calling

The Responses API provides server-side orchestration. The model decides which tools to call, Llama Stack executes them, and feeds results back automatically:

tools.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")

response = client.responses.create(
model="ollama/llama3.2:3b",
input="What is the weather like in San Francisco?",
tools=[{
"type": "function",
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
},
"required": ["location"],
},
}],
)

# The model will request a tool call - check the output
for item in response.output:
print(item)

RAG with Vector Stores

Upload documents, create a vector store, and ask questions. Llama Stack handles chunking, embedding, and retrieval:

rag.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")

# Upload a document
file = client.files.create(
file=open("my-document.pdf", "rb"),
purpose="assistants",
)
print(f"Uploaded: {file.id}")

# Create a vector store and index the file
vector_store = client.vector_stores.create(
name="my-docs",
file_ids=[file.id],
)
print(f"Vector store: {vector_store.id}")

# Ask questions with file search
response = client.responses.create(
model="ollama/llama3.2:3b",
input="What are the key points in this document?",
tools=[{
"type": "file_search",
"vector_store_ids": [vector_store.id],
}],
)
print(response.output_text)

Multi-turn Conversations

Use previous_response_id to build multi-turn conversations without managing message history:

conversation.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")

# First turn
r1 = client.responses.create(
model="ollama/llama3.2:3b",
input="My name is Alice and I'm building a RAG app",
)
print("Assistant:", r1.output_text)

# Second turn - references the first
r2 = client.responses.create(
model="ollama/llama3.2:3b",
input="What did I say my name was?",
previous_response_id=r1.id,
)
print("Assistant:", r2.output_text)

MCP Tools

Connect to any MCP server and use its tools:

mcp.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")

response = client.responses.create(
model="ollama/llama3.2:3b",
input="List files in the current directory",
tools=[{
"type": "mcp",
"server_label": "filesystem",
"server_url": "http://localhost:3000/sse",
}],
)
print(response.output_text)

Switching Providers

The same code works with any backend. Just change the server config:

export OLLAMA_URL=http://localhost:11434/v1
uv run llama stack run starter

Running as a Library

You can also use Llama Stack without running a server, directly in your Python process:

library.py
from llama_stack.core.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient("starter")
client.initialize()

# Use the same OpenAI-compatible interface
response = client.responses.create(
model="ollama/llama3.2:3b",
input="Hello from library mode!",
)
print(response.output_text)

Next Steps