Welcome to Llama Stack
Open-source AI application server. Not just inference routing, the full stack.
Llama Stack composes inference, vector stores, file storage, safety, tool calling, and agentic orchestration into a single OpenAI-compatible server. Use any client, any language, any model. Swap providers without changing application code.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Hello"}],
)
Choose your path
Quickstart
Install and run Llama Stack in under 5 minutes
API Reference
OpenAI-compatible API endpoints and usage
Providers
Inference, vector store, safety, and tool providers
Core Concepts
Architecture, APIs, distributions, and resources
Building Applications
RAG, agents, tools, evals, and more
Distributions
Pre-packaged deployment configurations
What you get
| API | Endpoint | Description |
|---|---|---|
| Chat Completions | /v1/chat/completions | Text and vision inference, streaming, tool calling |
| Responses | /v1/responses | Server-side agentic orchestration with tool calling, MCP integration, and built-in file search (RAG) |
| Embeddings | /v1/embeddings | Text embeddings for search and retrieval |
| Vector Stores | /v1/vector_stores | Managed document storage and search |
| Moderations | /v1/moderations | Content safety via Llama Guard and other shields |
| Files | /v1/files | File upload and management |
| Batches | /v1/batches | Offline batch processing |
| Models | /v1/models | Model listing and management |
| Messages | /v1/messages | Anthropic Messages API adapter |
Beyond the OpenAI specification, Llama Stack provides Prompts for prompt template management, File Processors for document ingestion, and Connectors for external tool registration.
The Responses API implementation conforms to the Open Responses specification. See the API conformance report for detailed coverage.
A server, not a library
Llama Stack is an HTTP server. Your application talks to a standard API over HTTP. This is a different architectural choice than SDK-level frameworks that abstract at the Python import level.
The consequence: your application is language-agnostic. Write it in Python, Go, TypeScript, or curl. Swap the server without touching application code. Replace the entire inference backend without redeploying your application.
50+ pluggable providers
Llama Stack has a pluggable provider architecture across every API, not just inference.
- 23 inference providers: Ollama, vLLM, OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Gemini, Vertex AI, NVIDIA NIM, Fireworks, Together AI, Groq, SambaNova, Cerebras, WatsonX, and more
- 15 vector store providers: FAISS, SQLite-vec, ChromaDB, Qdrant, Milvus, PGVector, Weaviate, Elasticsearch, and more
- 7 safety providers: Llama Guard, Prompt Guard, Code Scanner, Bedrock Guardrails, NVIDIA NeMo, and more
- 6 tool runtimes: File Search, Brave/Bing/Tavily web search, Wolfram Alpha, MCP
Develop locally with Ollama and FAISS. Deploy to production with vLLM and PGVector. Wrap Bedrock or Vertex without lock-in. Same API surface, different backend.
See the provider documentation for the full list and the provider compatibility matrix for tested feature coverage.
Get started
# One-line install
curl -LsSf https://github.com/llamastack/llama-stack/raw/main/scripts/install.sh | bash
# Or install via pip
pip install llama-stack
# Start the server
llama stack run
Llama Stack works with any OpenAI-compatible client. Point your existing code at http://localhost:8321/v1 and you're ready to go.
Quick Start Guide | OpenAI API Compatibility | GitHub
Found an issue with the docs?
If you've found some issue with our documentation, please open up a Bug in our GitHub Issues referencing the page and the problem you are facing. Thank you for your help in improving our documentation!