Claude Code Integration

Llama Stack can act as a backend for Claude Code, Anthropic's AI coding assistant. Point Claude Code at your Llama Stack server and use any model—from local open models (vLLM, Ollama) to cloud providers (OpenAI, Fireworks, Groq).

Quick Start

1. Start Llama Stack

# With any provider (examples)
export OPENAI_API_KEY="your-key-here"
llama stack run starter

# Or with vLLM
export VLLM_URL="http://localhost:8000/v1"
llama stack run starter

# Or with Ollama
export OLLAMA_URL="http://localhost:11434/v1"
llama stack run starter

2. Configure Claude Code

Point Claude Code at your Llama Stack server using the ANTHROPIC_BASE_URL environment variable:

export ANTHROPIC_BASE_URL="http://localhost:8321"
export ANTHROPIC_API_KEY="fake"  # Not validated when using local providers

# Use Claude Code normally
claude "Write a hello world function in Python"

3. Test it out

# Simple query
claude "What is 2+2?"

# Code generation with file creation
claude "Create a Flask hello world app"

# Multi-turn conversation
claude "Write a quicksort in Rust"
claude "Add documentation and tests"

How It Works

Claude Code sends requests to the Anthropic Messages API (/v1/messages). Llama Stack implements this API, translating between formats as needed:

Claude Code → Llama Stack /v1/messages → Provider
                          ↓
                     (translate if needed)
                          ↓
                     OpenAI, vLLM, etc.

What gets translated:

Messages format: Anthropic → OpenAI format (when provider doesn't support Messages API natively)
Tool calls: Anthropic tool_use blocks → OpenAI tool_calls
Streaming: OpenAI SSE events → Anthropic format (message_start, content_block_delta, etc.)
Thinking blocks: Extended thinking support for supported models

Native passthrough (no translation needed):

Ollama with /v1/messages support
vLLM with Anthropic format support

Model Configuration

Specifying Models

Always use the --model flag to specify which Llama Stack model to use:

# Use any provider model registered in Llama Stack
claude --model "openai/gpt-4o-mini" "Hello world"
claude --model "vllm/Qwen/Qwen3-8B" "Write code"
claude --model "ollama/llama3.3:70b" "Explain algorithms"

Automatic Model Aliasing

The starter distribution automatically registers Claude model name aliases across all providers. This handles Claude Code's internal requests seamlessly:

# Pre-configured in starter config.yaml
registered_resources:
  models:
  - model_id: claude-haiku-4-5-20251001
    provider_id: "all"  # Registers alias across ALL providers
    provider_model_id: "auto"  # Auto-maps to appropriate model
    model_type: llm

When Claude Code sends internal requests using Claude model names, these aliases automatically route to your specified model. You never need to reference the Claude model names directly—just use --model with your actual Llama Stack models.

Supported Features

Core Capabilities

✅ All Messages API features: Multi-turn conversations, system messages, streaming
✅ Tool use: File operations, shell commands, code execution (via Claude Code's built-in tools)
✅ Extended thinking: Thinking blocks for reasoning transparency
✅ Token counting: /v1/messages/count_tokens endpoint
✅ Prompt caching: When using providers that support it (Anthropic, Bedrock)
✅ Any inference provider: OpenAI, vLLM, Ollama, Fireworks, Together, Groq, Bedrock, etc.

Provider-Specific Features

Different providers have different strengths when used with Claude Code:

Provider	Native Messages API	Thinking Support	Prompt Caching	Notes
OpenAI	❌ (translated)	⚠️ (via reasoning)	❌	Works well, no translation overhead for responses
vLLM	✅	❌	❌	Serves Messages API natively with compatible models
Ollama	✅	❌	❌	Serves Messages API natively with compatible models
Bedrock, Fireworks, Groq, Together	❌ (translated)	❌	❌	Works via OpenAI translation

Configuration Examples

Using OpenAI Models

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_BASE_URL="http://localhost:8321"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="openai/gpt-4o-mini"
export ANTHROPIC_DEFAULT_SONNET_MODEL="openai/gpt-4o"

llama stack run starter
claude "Implement a binary search tree"

Using vLLM with Qwen Models

# Start vLLM server
vllm serve Qwen/Qwen3-8B --api-key fake

# Start Llama Stack
export VLLM_URL="http://localhost:8000/v1"
export ANTHROPIC_BASE_URL="http://localhost:8321"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="vllm/Qwen/Qwen3-8B"

llama stack run starter
claude "Write a Fibonacci function"

Using Ollama with Llama Models

# Start Ollama
ollama serve

# Pull a model
ollama pull llama3.3:70b

# Start Llama Stack
export OLLAMA_URL="http://localhost:11434/v1"
export ANTHROPIC_BASE_URL="http://localhost:8321"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="ollama/llama3.3:70b"

llama stack run starter
claude "Explain quicksort"

Using Multiple Providers

You can configure different Claude model tiers to route to different providers:

# Fast model → local vLLM
export ANTHROPIC_DEFAULT_HAIKU_MODEL="vllm/Qwen/Qwen3-8B"

# Balanced model → OpenAI GPT-4o
export ANTHROPIC_DEFAULT_SONNET_MODEL="openai/gpt-4o"

# Most capable → OpenAI o1
export ANTHROPIC_DEFAULT_OPUS_MODEL="openai/o1"

# Claude Code will route based on which model name it sends
claude "Quick task"  # Uses haiku → vLLM
claude --model claude-sonnet-4-5 "Complex task"  # Uses sonnet → OpenAI GPT-4o

Troubleshooting

"Model not found" errors

Symptom: 404 Model 'claude-haiku-4-5-20251001' not found

Solution: Set the appropriate environment variable:

export ANTHROPIC_DEFAULT_HAIKU_MODEL="your-provider/your-model"
# Or explicitly pass --model to claude command
claude --model "openai/gpt-4o-mini" "your prompt"

"API key not valid" errors when using local providers

Symptom: Authentication errors even though you're using a local provider

Solution: Set a fake API key (not validated by llama-stack):

export ANTHROPIC_API_KEY="fake"

Slow responses with cloud providers

Symptom: Long latency when using OpenAI, Fireworks, etc.

Explanation: There's a double proxy overhead (Claude Code → llama-stack → provider). Consider using local providers (vLLM, Ollama) for better performance.

Tool use not working

Symptom: Claude Code can't execute shell commands or file operations

Explanation: Tool execution happens in Claude Code's runtime, not llama-stack. Ensure Claude Code has proper permissions and your model supports tool use.

Performance Considerations

Latency Breakdown

Total latency = Claude Code overhead + llama-stack processing + provider API call + translation overhead

Local providers (vLLM, Ollama): Minimal translation overhead, total latency dominated by inference
Cloud providers (OpenAI, Groq): Network round-trip is the bottleneck
Format translation: Adds ~5-20ms depending on message complexity

Optimization Tips

Use local providers when possible (vLLM, Ollama) to minimize network latency
Enable prompt caching with providers that support it (set ANTHROPIC_API_KEY to use Bedrock)
Configure native Messages API support in vLLM/Ollama to skip translation overhead
Use streaming (enabled by default in Claude Code) for faster perceived response times

Differences from Anthropic Claude

While llama-stack provides full Messages API compatibility, there are some behavioral differences when using alternative models:

Feature	Anthropic Claude	Open Models (via llama-stack)
Thinking blocks	Native support	Varies by model (GPT-4o has reasoning)
Prompt caching	Available	Only if provider supports it
Extended context	200K+ tokens	Depends on model (Qwen3: 32K, Llama3: 128K)
Tool use format	Optimized for Claude	Translated to OpenAI format
Response quality	Claude-specific	Depends on underlying model

Advanced Configuration

Custom Model Mappings

If you want more control over how Claude model names map to your providers, you can register models explicitly:

# Start llama stack
llama stack run starter

# Register models via API (after startup)
curl http://localhost:8321/v1/models \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "claude-haiku-4-5-20251001",
    "provider_id": "vllm",
    "provider_model_id": "Qwen/Qwen3-8B",
    "model_type": "llm"
  }'

Or add to your config.yaml:

registered_resources:
  models:
  - model_id: claude-haiku-4-5-20251001
    provider_id: vllm
    provider_model_id: Qwen/Qwen3-8B
    model_type: llm

Using with Claude Agent SDK

If you're building custom agents with the Claude Agent SDK, llama-stack works as a drop-in backend:

from claude_agent_sdk import Agent

agent = Agent(
    base_url="http://localhost:8321",
    api_key="fake",  # Not validated
    model="vllm/Qwen/Qwen3-8B",
)

response = agent.send("Write a function to parse CSV files")

Anthropic Messages API - Full API reference and conformance details
Providers - Available inference providers and configuration
Starter Distribution - Default distribution setup

Contributing

Found an issue or want to improve Claude Code integration? Contributions welcome:

Report bugs: GitHub Issues
Improve docs: Documentation source
Add features: See CONTRIBUTING.md

Quick Start​

1. Start Llama Stack​

2. Configure Claude Code​

3. Test it out​

How It Works​

Model Configuration​

Specifying Models​

Automatic Model Aliasing​

Supported Features​

Core Capabilities​

Provider-Specific Features​

Configuration Examples​

Using OpenAI Models​

Using vLLM with Qwen Models​

Using Ollama with Llama Models​

Using Multiple Providers​

Troubleshooting​

"Model not found" errors​

"API key not valid" errors when using local providers​

Slow responses with cloud providers​

Tool use not working​

Performance Considerations​

Latency Breakdown​

Optimization Tips​

Differences from Anthropic Claude​

Advanced Configuration​

Custom Model Mappings​

Using with Claude Agent SDK​

Related Documentation​

Contributing​