Detailed Tutorial
Beyond the Quickstart
This tutorial assumes you've completed the Quickstart and have a running Llama Stack server. We'll build progressively more complex applications using the standard OpenAI SDK.
Chat Completions
The most basic usage - a simple chat conversation:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
# List available models
models = client.models.list()
for m in models.data:
print(f" {m.id} ({m.object})")
# Simple chat
response = client.chat.completions.create(
model="ollama/llama3.2:3b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about open source"},
],
)
print(response.choices[0].message.content)
Streaming
Stream responses token by token for a more interactive experience:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
stream = client.chat.completions.create(
model="ollama/llama3.2:3b",
messages=[{"role": "user", "content": "Explain RAG in 3 sentences"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
Responses API with Tool Calling
The Responses API provides server-side orchestration. The model decides which tools to call, Llama Stack executes them, and feeds results back automatically:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.responses.create(
model="ollama/llama3.2:3b",
input="What is the weather like in San Francisco?",
tools=[{
"type": "function",
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
},
"required": ["location"],
},
}],
)
# The model will request a tool call - check the output
for item in response.output:
print(item)
RAG with Vector Stores
Upload documents, create a vector store, and ask questions. Llama Stack handles chunking, embedding, and retrieval:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
# Upload a document
file = client.files.create(
file=open("my-document.pdf", "rb"),
purpose="assistants",
)
print(f"Uploaded: {file.id}")
# Create a vector store and index the file
vector_store = client.vector_stores.create(
name="my-docs",
file_ids=[file.id],
)
print(f"Vector store: {vector_store.id}")
# Ask questions with file search
response = client.responses.create(
model="ollama/llama3.2:3b",
input="What are the key points in this document?",
tools=[{
"type": "file_search",
"vector_store_ids": [vector_store.id],
}],
)
print(response.output_text)
Multi-turn Conversations
Use previous_response_id to build multi-turn conversations without managing message history:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
# First turn
r1 = client.responses.create(
model="ollama/llama3.2:3b",
input="My name is Alice and I'm building a RAG app",
)
print("Assistant:", r1.output_text)
# Second turn - references the first
r2 = client.responses.create(
model="ollama/llama3.2:3b",
input="What did I say my name was?",
previous_response_id=r1.id,
)
print("Assistant:", r2.output_text)
MCP Tools
Connect to any MCP server and use its tools:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.responses.create(
model="ollama/llama3.2:3b",
input="List files in the current directory",
tools=[{
"type": "mcp",
"server_label": "filesystem",
"server_url": "http://localhost:3000/sse",
}],
)
print(response.output_text)
Switching Providers
The same code works with any backend. Just change the server config:
- Ollama
- OpenAI
export OLLAMA_URL=http://localhost:11434/v1
uv run llama stack run starter
export OPENAI_API_KEY=sk-xxx
uv run llama stack run starter
Your client code stays the same. Just update the model name:
response = client.responses.create(
model="openai/gpt-4o", # now using OpenAI
input="What is Llama Stack?",
)
Running as a Library
You can also use Llama Stack without running a server, directly in your Python process:
from llama_stack.core.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("starter")
client.initialize()
# Use the same OpenAI-compatible interface
response = client.responses.create(
model="ollama/llama3.2:3b",
input="Hello from library mode!",
)
print(response.output_text)
Next Steps
- Building Applications - RAG, agents, safety
- Providers - all supported backends
- API Reference - full endpoint documentation
- Deploying - Kubernetes, production setup