Skip to main content
NewLlama Stack is now OGXRead the story

Not a gateway.
The full stack.

Inference, vector stores, file storage, moderation, tool calling, and agentic orchestration — as a server or a Python library. Pluggable providers, any language, deploy anywhere.

Run as a server or import as a Python library (requires uv)

uvx --from 'ogx[starter]' ogx stack run starter
/v1/responses
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.responses.create(
model="llama-3.3-70b",
input="Summarize this repository",
tools=[{"type": "web_search"}],
)

Your tools. Any model.

Configure OGX with any provider — Ollama, vLLM, Bedrock, Azure, or your own. Then point Claude Code, Codex, or OpenCode at it. Same workflow, any model.

Claude Code → OGX
Codex → OGX

Everything your AI app needs. One process.

More than inference routing. OGX composes inference, storage, moderation, and orchestration into a single process — whether you run it as a server or import it as a library. Your agent can search a vector store, call a tool, apply moderation checks, and stream the response. No glue code. No sidecar services.

Inference

/v1/chat/completionsChat Completions
/v1/responsesResponses
/v1/embeddingsEmbeddings
/v1/modelsModels
/v1/messagesMessagesAnthropic
/v1alpha/interactionsInteractionsGoogle

Data

/v1/vector_storesVector Stores
/v1/filesFiles
/v1/batchesBatches

Moderation & Tools

/v1/moderationsModerations
/v1/toolsTools
/v1/connectorsConnectors
Full API reference

Server or library. Your call.

Deploy OGX as an HTTP server for production — any language, any client, standard API. Or import it directly as a Python library for scripts, notebooks, and rapid prototyping with zero network overhead.

Same capabilities either way. Start with the library, graduate to the server when you need multi-language access or independent scaling.

ServerPOST /v1/responsesany language
Libraryclient.responses.create(...)zero overhead

23 inference providers. 13 vector stores. 7 safety backends.

Develop locally with Ollama. Deploy to production with vLLM. Wrap Bedrock or Vertex without lock-in. Same API surface, different backend.

All providers

How it works

Your application talks to one process — either an HTTP server or an in-process library client. That process routes to pluggable providers for inference, vector storage, files, moderation, and tools. The composition happens at the OGX level, not in your application code.

OGX Architecture

Open source

Apache 2.0 licensed. Contributions welcome.