OpenAI API Compatibility
Llama Stack is OpenAI-first. The primary API surface implements the OpenAI spec, so you can use any OpenAI-compatible client, just change the base URL.
Llama Stack also provides compatibility layers for other SDKs:
- Anthropic Messages API at
/v1/messagesfor teams using the Anthropic SDK. See the Anthropic Messages API docs, including a conformance report. - Google Interactions API at
/v1alpha/interactionsfor teams using the Google GenAI SDK. See the Google Interactions compatibility guide for details.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Hello"}],
)
This works with the OpenAI Python client, TypeScript client, or any framework that speaks the OpenAI API (LangChain, LlamaIndex, CrewAI, etc.).
Implemented endpoints
| API | Endpoint | Description |
|---|---|---|
| Chat Completions | /v1/chat/completions | Text and vision inference, streaming, tool calling |
| Completions | /v1/completions | Text completions |
| Embeddings | /v1/embeddings | Text embeddings |
| Models | /v1/models | Model listing and management |
| Files | /v1/files | File upload and management |
| Vector Stores | /v1/vector_stores | Document storage and search |
| Batches | /v1/batches | Offline batch processing |
| Responses | /v1/responses | Server-side agentic orchestration with tool calling, MCP, and file search |
| Conversations | /v1/conversations | Conversation state management |
For property-level conformance details and missing endpoints, see the conformance report.
Llama Stack also provides additional APIs beyond the OpenAI specification, including Prompts for prompt template management and File Processors for document ingestion pipelines.
Responses API
The Responses API is Llama Stack's most distinctive feature. It moves the agentic loop to the server, so a single API call can:
- Call tools — the server executes tool calls and feeds results back to the model automatically
- Connect to MCP servers — use any Model Context Protocol server as a tool source
- Search files — built-in RAG via vector stores, no external retrieval pipeline needed
- Manage conversations — server-side state with
previous_response_idchaining
The implementation conforms to the Open Responses specification. See how the Responses API works internally for an interactive flow diagram showing how requests are processed.
response = client.responses.create(
model="llama-3.3-70b",
input="What files mention the Q4 results?",
tools=[{
"type": "file_search",
"vector_store_ids": ["vs_abc123"],
}],
)
Provider compatibility
Not all features are available on all inference providers. See the provider compatibility matrix for tested feature coverage across Azure, Bedrock, OpenAI, vLLM, WatsonX, and others.
Known limitations
For a detailed breakdown of schema differences and conformance issues by endpoint, see the conformance report and the Responses API limitations.