Skip to main content

Welcome to Llama Stack

Open-source AI application server. Not just inference routing, the full stack.

Llama Stack composes inference, vector stores, file storage, safety, tool calling, and agentic orchestration into a single OpenAI-compatible server. Use any client, any language, any model. Swap providers without changing application code.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Hello"}],
)

Choose your path

What you get

APIEndpointDescription
Chat Completions/v1/chat/completionsText and vision inference, streaming, tool calling
Responses/v1/responsesServer-side agentic orchestration with tool calling, MCP integration, and built-in file search (RAG)
Embeddings/v1/embeddingsText embeddings for search and retrieval
Vector Stores/v1/vector_storesManaged document storage and search
Moderations/v1/moderationsContent safety via Llama Guard and other shields
Files/v1/filesFile upload and management
Batches/v1/batchesOffline batch processing
Models/v1/modelsModel listing and management
Messages/v1/messagesAnthropic Messages API adapter

Beyond the OpenAI specification, Llama Stack provides Prompts for prompt template management, File Processors for document ingestion, and Connectors for external tool registration.

The Responses API implementation conforms to the Open Responses specification. See the API conformance report for detailed coverage.

A server, not a library

Llama Stack is an HTTP server. Your application talks to a standard API over HTTP. This is a different architectural choice than SDK-level frameworks that abstract at the Python import level.

The consequence: your application is language-agnostic. Write it in Python, Go, TypeScript, or curl. Swap the server without touching application code. Replace the entire inference backend without redeploying your application.

50+ pluggable providers

Llama Stack has a pluggable provider architecture across every API, not just inference.

  • 23 inference providers: Ollama, vLLM, OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Gemini, Vertex AI, NVIDIA NIM, Fireworks, Together AI, Groq, SambaNova, Cerebras, WatsonX, and more
  • 15 vector store providers: FAISS, SQLite-vec, ChromaDB, Qdrant, Milvus, PGVector, Weaviate, Elasticsearch, and more
  • 7 safety providers: Llama Guard, Prompt Guard, Code Scanner, Bedrock Guardrails, NVIDIA NeMo, and more
  • 6 tool runtimes: File Search, Brave/Bing/Tavily web search, Wolfram Alpha, MCP

Develop locally with Ollama and FAISS. Deploy to production with vLLM and PGVector. Wrap Bedrock or Vertex without lock-in. Same API surface, different backend.

See the provider documentation for the full list and the provider compatibility matrix for tested feature coverage.

Get started

# One-line install
curl -LsSf https://github.com/llamastack/llama-stack/raw/main/scripts/install.sh | bash

# Or install via pip
pip install llama-stack

# Start the server
llama stack run
tip

Llama Stack works with any OpenAI-compatible client. Point your existing code at http://localhost:8321/v1 and you're ready to go.

Quick Start Guide | OpenAI API Compatibility | GitHub

Found an issue with the docs?

If you've found some issue with our documentation, please open up a Bug in our GitHub Issues referencing the page and the problem you are facing. Thank you for your help in improving our documentation!