API Reference
Llama Stack provides a comprehensive set of APIs for building generative AI applications. All APIs follow OpenAI-compatible standards and can be used interchangeably across different providers.
Core APIs
Inference API
Run inference with Large Language Models (LLMs) and embedding models.
Supported Providers:
- Builtin (Single Node)
- Ollama (Single Node)
- Fireworks (Hosted)
- Together (Hosted)
- NVIDIA NIM (Hosted and Single Node)
- vLLM (Hosted and Single Node)
- AWS Bedrock (Hosted)
- Cerebras (Hosted)
- Groq (Hosted)
- SambaNova (Hosted)
- PyTorch ExecuTorch (On-device iOS, Android)
- OpenAI (Hosted)
- Anthropic (Hosted)
- Gemini (Hosted)
- WatsonX (Hosted)
Agents API
Run multi-step agentic workflows with LLMs, including tool usage, memory (RAG), and complex reasoning.
Supported Providers:
- Builtin (Single Node)
- Fireworks (Hosted)
- Together (Hosted)
- PyTorch ExecuTorch (On-device iOS)
Vector IO API
Perform operations on vector stores, including adding documents, searching, and deleting documents.
Supported Providers:
- FAISS (Single Node)
- SQLite-Vec (Single Node)
- Chroma (Hosted and Single Node)
- Milvus (Hosted and Single Node)
- Postgres (PGVector) (Hosted and Single Node)
- Weaviate (Hosted)
- Qdrant (Hosted and Single Node)
Files API (OpenAI-compatible)
Manage file uploads, storage, and retrieval with OpenAI-compatible endpoints.
Supported Providers:
- Local Filesystem (Single Node)
- S3 (Hosted)
Vector Store Files API (OpenAI-compatible)
Integrate file operations with vector stores for automatic document processing and search.
Supported Providers:
- FAISS (Single Node)
- SQLite-vec (Single Node)
- Milvus (Single Node)
- ChromaDB (Hosted and Single Node)
- Qdrant (Hosted and Single Node)
- Weaviate (Hosted)
- Postgres (PGVector) (Hosted and Single Node)
Safety API
Apply safety policies to outputs at a systems level, not just model level.
Supported Providers:
- Llama Guard (Depends on Inference Provider)
- Prompt Guard (Single Node)
- Code Scanner (Single Node)
- AWS Bedrock (Hosted)
Post Training API
Fine-tune models for specific use cases and domains.
Supported Providers:
- Builtin (Single Node)
- HuggingFace (Single Node)
- TorchTune (Single Node)
- NVIDIA NEMO (Hosted)
Eval API
Generate outputs and perform scoring to evaluate system performance.
Supported Providers:
- Builtin (Single Node)
- NVIDIA NEMO (Hosted)
Telemetry API
Collect telemetry data from the system for monitoring and observability.
Supported Providers:
- Builtin (Single Node)
Tool Runtime API
Interact with various tools and protocols to extend LLM capabilities.
Supported Providers:
- Brave Search (Hosted)
- RAG Runtime (Single Node)
API Compatibility
All Llama Stack APIs are designed to be OpenAI-compatible, allowing you to:
- Use existing OpenAI API clients and tools
- Migrate from OpenAI to other providers seamlessly
- Maintain consistent API contracts across different environments
Getting Started
To get started with Llama Stack APIs:
- Choose a Distribution: Select a pre-configured distribution that matches your environment
- Configure Providers: Set up the providers you want to use for each API
- Start the Server: Launch the Llama Stack server with your configuration
- Use the APIs: Make requests to the API endpoints using your preferred client
For detailed setup instructions, see our Getting Started Guide.
Provider Details
For complete provider compatibility and setup instructions, see our Providers Documentation.
API Stability
Llama Stack APIs are organized by stability level:
- Stable APIs - Production-ready APIs with full support
- Experimental APIs - APIs in development with limited support
- Deprecated APIs - Legacy APIs being phased out
OpenAI Integration
For specific OpenAI API compatibility features, see our OpenAI Compatibility Guide.