Skip to main content

Welcome to Llama Stack

Open-source agentic API server for building AI applications. OpenAI-compatible. Any model, any infrastructure.

Llama Stack is a drop-in replacement for the OpenAI API that you can run anywhere — your laptop, your datacenter, or the cloud. Use any OpenAI-compatible client or agentic framework. Swap between Llama, GPT, Gemini, Mistral, or any model without changing your application code.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Hello"}],
)

What you get

APIEndpointDescription
Chat Completions/v1/chat/completionsText and vision inference, streaming, tool calling
Responses/v1/responsesServer-side agentic orchestration with tool calling, MCP integration, and built-in file search (RAG)
Embeddings/v1/embeddingsText embeddings for search and retrieval
Vector Stores/v1/vector_storesManaged document storage and search
Files/v1/filesFile upload and management
Batches/v1/batchesOffline batch processing
Models/v1/modelsModel listing and management

Llama Stack also provides additional APIs beyond the OpenAI specification, including Prompts for prompt template management and File Processors for document ingestion pipelines.

The Responses API implementation conforms to the Open Responses specification. See the API conformance report for detailed coverage.

Use any model, use any infrastructure

Llama Stack has a pluggable provider architecture. Develop locally with Ollama, deploy to production with vLLM, or connect to a managed service — the API stays the same.

Inference providers: Ollama, vLLM, TGI, Fireworks, Together, AWS Bedrock, Azure OpenAI, NVIDIA NIM, OpenAI, Anthropic, Gemini, Groq, SambaNova, Cerebras, WatsonX, and more.

Vector store providers: FAISS, SQLite-vec, Milvus, ChromaDB, PGVector, Qdrant, Weaviate, Elasticsearch, Infinispan.

See the provider documentation for the full list and the provider compatibility matrix for tested feature coverage across providers.

Get started

# One-line install
curl -LsSf https://github.com/llamastack/llama-stack/raw/main/scripts/install.sh | bash

# Or install via pip
pip install llama-stack

# Start the server
llama stack run

Then connect with any OpenAI-compatible client.

Quick Start Guide | OpenAI API Compatibility | GitHub