OpenAI API Compatibility

Llama Stack implements the OpenAI API, so you can use any OpenAI-compatible client — just change the base URL.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello"}],
)

This works with the OpenAI Python client, TypeScript client, or any framework that speaks the OpenAI API (LangChain, LlamaIndex, CrewAI, etc.).

Implemented endpoints

API	Endpoint	Description
Chat Completions	`/v1/chat/completions`	Text and vision inference, streaming, tool calling
Completions	`/v1/completions`	Text completions
Embeddings	`/v1/embeddings`	Text embeddings
Models	`/v1/models`	Model listing and management
Files	`/v1/files`	File upload and management
Vector Stores	`/v1/vector_stores`	Document storage and search
Batches	`/v1/batches`	Offline batch processing
Responses	`/v1/responses`	Server-side agentic orchestration with tool calling, MCP, and file search
Conversations	`/v1/conversations`	Conversation state management

For property-level conformance details and missing endpoints, see the conformance report.

Llama Stack also provides additional APIs beyond the OpenAI specification, including Prompts for prompt template management and File Processors for document ingestion pipelines.

Responses API

The Responses API is Llama Stack's most distinctive feature. It moves the agentic loop to the server, so a single API call can:

Call tools — the server executes tool calls and feeds results back to the model automatically
Connect to MCP servers — use any Model Context Protocol server as a tool source
Search files — built-in RAG via vector stores, no external retrieval pipeline needed
Manage conversations — server-side state with previous_response_id chaining

The implementation conforms to the Open Responses specification.

response = client.responses.create(
    model="llama-3.3-70b",
    input="What files mention the Q4 results?",
    tools=[{
        "type": "file_search",
        "vector_store_ids": ["vs_abc123"],
    }],
)

Provider compatibility

Not all features are available on all inference providers. See the provider compatibility matrix for tested feature coverage across Azure, Bedrock, OpenAI, vLLM, WatsonX, and others.

Known limitations

For a detailed breakdown of schema differences and conformance issues by endpoint, see the conformance report and the Responses API limitations.

Implemented endpoints​

Responses API​

Provider compatibility​

Known limitations​

Implemented endpoints

Responses API

Provider compatibility

Known limitations