Skip to main content

Architecture

Llama Stack is an AI application server that composes inference, vector stores, file storage, safety, tools, and agentic orchestration behind a unified, OpenAI-compatible API. It is provider-agnostic: the same API works whether the backend is Ollama, OpenAI, vLLM, Bedrock, or dozens of other services.

This is a server-level abstraction, not a library. Your application talks to Llama Stack over HTTP. It doesn't import a Python SDK, doesn't couple to a specific framework, and doesn't manage provider connections. The server handles provider routing, tool execution, safety checks, and multi-turn orchestration. Your application makes HTTP calls.

Llama Stack Architecture

API Surface

Llama Stack exposes three API families through a single server:

  • OpenAI API — Chat Completions, Responses, Embeddings, Vector Stores, Files, Batches, Models, Moderations, Conversations. This is the primary API surface. Any OpenAI SDK works.
  • Anthropic API — Messages endpoint (/v1/messages). An adapter that translates Anthropic-format requests to the inference API. Teams using the Anthropic SDK can point it at Llama Stack without code changes. See the Messages provider docs for current limitations.
  • Native APIs — Connectors, Tools, File Processors, and Safety shields. These extend beyond the OpenAI and Anthropic specs for capabilities like MCP tool registration and document ingestion.

The Responses API (/v1/responses) deserves special attention. It implements server-side agentic orchestration: when a model requests tool calls, the server executes them (file search, web search, MCP tools), feeds results back, and repeats until the model produces a final response. Your application sends one request and gets back a complete, tool-augmented answer. This orchestration runs on the server, not in your application code.

Two Packages

The codebase is split into two packages:

  • llama-stack-api - Lightweight package with API protocol definitions, Pydantic data types, and provider specs. No server code. Third-party providers depend only on this.
  • llama-stack - The server: provider resolution, routing, storage, CLI, and all built-in providers.

Request Flow

Example: POST /v1/chat/completions with model: "ollama/llama3.2:3b":

  1. FastAPI dispatches to the inference router
  2. InferenceRouter calls routing_table.get_provider_impl("ollama/llama3.2:3b")
  3. The routing table finds the model belongs to the ollama provider
  4. The router delegates to the Ollama provider's openai_chat_completion() method
  5. The provider forwards to the Ollama server and streams the response back

Provider Architecture

Providers come in two types:

TypeExampleHow it works
Inline (inline::)inline::faiss, inline::sentence-transformersRuns in the Llama Stack process
Remote (remote::)remote::ollama, remote::openaiAdapts an external service

Each provider declares which API it implements, its config class, and its dependencies. The provider registry (src/llama_stack/providers/registry/) lists all available providers per API.

Auto-Routing

Many APIs use automatic routing so multiple providers can serve different resources through the same API:

Routing TableRouterPurpose
Api.modelsApi.inferenceRoute to correct inference provider per model
Api.shieldsApi.safetyRoute to correct safety provider per shield
Api.vector_storesApi.vector_ioRoute to correct vector store provider
Api.tool_groupsApi.tool_runtimeRoute to correct tool runtime

This means you can have Ollama serving one model and OpenAI serving another, both accessible through the same /v1/chat/completions endpoint.

Configuration

A run config YAML defines everything about a running instance:

version: 2
distro_name: starter
providers:
inference:
- provider_id: ollama
provider_type: remote::ollama
config:
base_url: ${env.OLLAMA_URL:=http://localhost:11434/v1}
- provider_id: openai
provider_type: remote::openai
config:
api_key: ${env.OPENAI_API_KEY}

Key features:

  • Environment variable substitution: ${env.VAR:=default}
  • Conditional providers: ${env.API_KEY:+provider_id} enables a provider only when a variable is set
  • Multiple providers per API: both Ollama and OpenAI can serve inference, each handling different models

Distributions

A distribution is a pre-configured run config for a target environment. Think Kubernetes distributions (AKS, EKS, GKE) - the API stays the same, each distribution wires different backends.

DistributionUse case
starterGeneral purpose, supports most providers
dell, nvidiaHardware-specific optimizations
CustomBuild your own with llama stack build

Storage

Llama Stack persists state (registered models, conversation history, vector stores) using pluggable storage backends:

BackendUse case
SQLiteDefault, single-node development
PostgreSQLProduction deployments
RedisMulti-node caching

Storage is configured in the run config and shared across all providers.