Welcome to Llama Stack

Open-source agentic API server for building AI applications. OpenAI-compatible. Any model, any infrastructure.

Llama Stack is a drop-in replacement for the OpenAI API that you can run anywhere — your laptop, your datacenter, or the cloud. Use any OpenAI-compatible client or agentic framework. Swap between Llama, GPT, Gemini, Mistral, or any model without changing your application code.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello"}],
)

What you get

API	Endpoint	Description
Chat Completions	`/v1/chat/completions`	Text and vision inference, streaming, tool calling
Responses	`/v1/responses`	Server-side agentic orchestration with tool calling, MCP integration, and built-in file search (RAG)
Embeddings	`/v1/embeddings`	Text embeddings for search and retrieval
Vector Stores	`/v1/vector_stores`	Managed document storage and search
Files	`/v1/files`	File upload and management
Batches	`/v1/batches`	Offline batch processing
Models	`/v1/models`	Model listing and management

Llama Stack also provides additional APIs beyond the OpenAI specification, including Prompts for prompt template management and File Processors for document ingestion pipelines.

The Responses API implementation conforms to the Open Responses specification. See the API conformance report for detailed coverage.

Use any model, use any infrastructure

Llama Stack has a pluggable provider architecture. Develop locally with Ollama, deploy to production with vLLM, or connect to a managed service — the API stays the same.

Inference providers: Ollama, vLLM, TGI, Fireworks, Together, AWS Bedrock, Azure OpenAI, NVIDIA NIM, OpenAI, Anthropic, Gemini, Groq, SambaNova, Cerebras, WatsonX, and more.

Vector store providers: FAISS, SQLite-vec, Milvus, ChromaDB, PGVector, Qdrant, Weaviate, Elasticsearch, Infinispan.

See the provider documentation for the full list and the provider compatibility matrix for tested feature coverage across providers.

Get started

# One-line install
curl -LsSf https://github.com/llamastack/llama-stack/raw/main/scripts/install.sh | bash

# Or install via pip
pip install llama-stack

# Start the server
llama stack run

Then connect with any OpenAI-compatible client.

Quick Start Guide | OpenAI API Compatibility | GitHub

What you get​

Use any model, use any infrastructure​

Get started​

What you get

Use any model, use any infrastructure

Get started