Welcome to Llama Stack
Open-source agentic API server for building AI applications. OpenAI-compatible. Any model, any infrastructure.
Llama Stack is a drop-in replacement for the OpenAI API that you can run anywhere — your laptop, your datacenter, or the cloud. Use any OpenAI-compatible client or agentic framework. Swap between Llama, GPT, Gemini, Mistral, or any model without changing your application code.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Hello"}],
)
What you get
| API | Endpoint | Description |
|---|---|---|
| Chat Completions | /v1/chat/completions | Text and vision inference, streaming, tool calling |
| Responses | /v1/responses | Server-side agentic orchestration with tool calling, MCP integration, and built-in file search (RAG) |
| Embeddings | /v1/embeddings | Text embeddings for search and retrieval |
| Vector Stores | /v1/vector_stores | Managed document storage and search |
| Files | /v1/files | File upload and management |
| Batches | /v1/batches | Offline batch processing |
| Models | /v1/models | Model listing and management |
Llama Stack also provides additional APIs beyond the OpenAI specification, including Prompts for prompt template management and File Processors for document ingestion pipelines.
The Responses API implementation conforms to the Open Responses specification. See the API conformance report for detailed coverage.
Use any model, use any infrastructure
Llama Stack has a pluggable provider architecture. Develop locally with Ollama, deploy to production with vLLM, or connect to a managed service — the API stays the same.
Inference providers: Ollama, vLLM, TGI, Fireworks, Together, AWS Bedrock, Azure OpenAI, NVIDIA NIM, OpenAI, Anthropic, Gemini, Groq, SambaNova, Cerebras, WatsonX, and more.
Vector store providers: FAISS, SQLite-vec, Milvus, ChromaDB, PGVector, Qdrant, Weaviate, Elasticsearch, Infinispan.
See the provider documentation for the full list and the provider compatibility matrix for tested feature coverage across providers.
Get started
# One-line install
curl -LsSf https://github.com/llamastack/llama-stack/raw/main/scripts/install.sh | bash
# Or install via pip
pip install llama-stack
# Start the server
llama stack run
Then connect with any OpenAI-compatible client.