NewLlama Stack is now OGXRead the story →

Not a gateway.
The full stack.

Inference, vector stores, file storage, moderation, tool calling, and agentic orchestration — as a server or a Python library. Pluggable providers, any language, deploy anywhere.

Run as a server or import as a Python library (requires uv)

uvx --from 'ogx[starter]' ogx stack run starter

Get started API docs GitHub

/v1/responses

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.responses.create(
    model="llama-3.3-70b",
    input="Summarize this repository",
    tools=[{"type": "web_search"}],
)

Your tools. Any model.

Configure OGX with any provider — Ollama, vLLM, Bedrock, Azure, or your own. Then point Claude Code, Codex, or OpenCode at it. Same workflow, any model.

Claude Code → OGX

Codex → OGX

Everything your AI app needs. One process.

More than inference routing. OGX composes inference, storage, moderation, and orchestration into a single process — whether you run it as a server or import it as a library. Your agent can search a vector store, call a tool, apply moderation checks, and stream the response. No glue code. No sidecar services.

Inference

/v1/chat/completionsChat Completions

/v1/responsesResponses

/v1/embeddingsEmbeddings

/v1/modelsModels

/v1/messagesMessagesAnthropic

/v1alpha/interactionsInteractionsGoogle

Data

/v1/vector_storesVector Stores

/v1/filesFiles

/v1/batchesBatches

Moderation & Tools

/v1/moderationsModerations

/v1/toolsTools

/v1/connectorsConnectors

Full API reference

Server or library. Your call.

Deploy OGX as an HTTP server for production — any language, any client, standard API. Or import it directly as a Python library for scripts, notebooks, and rapid prototyping with zero network overhead.

Same capabilities either way. Start with the library, graduate to the server when you need multi-language access or independent scaling.

ServerPOST /v1/responsesany language

Libraryclient.responses.create(...)zero overhead

23 inference providers. 13 vector stores. 7 safety backends.

Develop locally with Ollama. Deploy to production with vLLM. Wrap Bedrock or Vertex without lock-in. Same API surface, different backend.

Ollama vLLM OpenAI Anthropic AWS Bedrock Azure OpenAI Gemini Together AI Fireworks PGVector Qdrant ChromaDB Milvus Weaviate

All providers

How it works

Your application talks to one process — either an HTTP server or an in-process library client. That process routes to pluggable providers for inference, vector storage, files, moderation, and tools. The composition happens at the OGX level, not in your application code.

Open source

Apache 2.0 licensed. Contributions welcome.

GitHub Discord Documentation Blog

Not a gateway.The full stack.