Skip to main content

OGX 1.0: The Open Agentic API Server is Production-Ready

· 5 min read
OGX Team
Core Team

Two weeks ago, we told you the name changed. Today, we're telling you it's done.

OGX 1.0 is a server that replaces the OpenAI API with something you own. Point your existing OpenAI, Anthropic, or Google SDK at it. Run any model on any infrastructure. Get server-side agentic orchestration, built-in RAG, MCP tool integration, multi-tenancy, and production observability out of the box. No vendor lock-in. No code changes.

This is not a beta. This is not "production-ready with caveats." This is v1.

The shortest version

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="any")

response = client.responses.create(
model="gpt-4o",
input="Find all documents about Q3 revenue and summarize the key trends.",
tools=[
{"type": "file_search", "vector_store_ids": ["vs_finance"]},
{
"type": "mcp",
"server_label": "analytics",
"server_url": "http://analytics:8000/sse",
},
],
)

That's a real agentic workflow. The server searches your vector store, calls your MCP tools, reasons over the results, and returns an answer. One API call. Your code stays simple. Swap gpt-4o for claude-sonnet-4-6 or llama-3.3-70b and nothing else changes.

What got us here

OGX started as an API standardization effort around Llama models called Llama Stack. It grew into something different: a server-side agentic loop that speaks the native API of every major frontier lab. Along the way, we made hard choices that v1 reflects.

We killed our own APIs. The fine-tuning API, the agents API, the knowledge_search tool, the meta-reference naming, the TGI and HuggingFace providers. Gone. Every one of them was replaced with something that aligns to existing industry standards or simplifies the system. The agents API became the Responses API. knowledge_search became file_search. We adopted OpenAI's terminology not because OpenAI is the standard, but because most developers already know it. Meeting developers where they are beats inventing new names.

We earned our compliance scores. OGX passes the Open Responses conformance test suite at 100%. The OpenAI API conformance score sits above 91%. These are not aspirational targets. They are tested on every commit.

We shipped the hard infrastructure. Multi-tenancy with attribute-based access control. Structured observability with per-request metrics, token throughput tracking, and model-level latency. A gateway-first architecture that delegates rate limiting and CORS to your infrastructure layer where it belongs. These are the features that separate a project from a product.

239 contributors across the project's lifetime. 23 inference providers. 21 vector store backends. And a rename that touched 1,696 files because we decided the name should match the mission.

Three SDKs, one server

Most "OpenAI-compatible" servers give you /v1/chat/completions and call it a day. OGX implements three API surfaces natively:

# OpenAI SDK
from openai import OpenAI

openai_client = OpenAI(base_url="http://localhost:8321/v1", api_key="any")

# Anthropic SDK
from anthropic import Anthropic

anthropic_client = Anthropic(base_url="http://localhost:8321/v1", api_key="any")

# Google GenAI SDK
from google import genai
from google.genai import types

google_client = genai.Client(
api_key="any",
http_options=types.HttpOptions(
base_url="http://localhost:8321",
api_version="v1alpha",
),
)

All three hit the same inference providers. The same OGX server can serve a team using the OpenAI SDK, another team using Anthropic's, and a third using Google's. Simultaneously. Against the same model running on the same GPU.

This decouples two decisions that used to be welded together: which SDK your team prefers and which model you deploy. Use the Anthropic SDK with Ollama. Use the Google SDK with vLLM. Use the OpenAI SDK with Bedrock. The server translates. Your code doesn't change.

Coding assistants work too

OGX is not just for app SDKs. You can also point coding assistants at the same OGX server:

# Claude Code -> Anthropic Messages API via OGX
export ANTHROPIC_BASE_URL="http://localhost:8321"
export ANTHROPIC_API_KEY="fake"
# Codex CLI -> OpenAI Responses API via OGX
model = "openai/gpt-4o"
model_provider = "ogx"

[model_providers.ogx]
name = "OpenAI"
base_url = "http://localhost:8321/v1"
wire_api = "responses"
supports_websockets = false

That means Claude Code, Codex CLI, your OpenAI SDK apps, and your Anthropic SDK apps can all share one OGX deployment and one provider layer.

See: Claude Code integration and Codex CLI integration.

The agentic loop, server-side

Before the Responses API, building an agent meant writing a client-side orchestration loop. Call the model. Check if it wants a tool. Execute the tool. Send results back. Repeat. Every application reimplemented this. Every implementation had its own bugs.

OGX moves that loop to the server. You send a question and a set of tools. The server handles planning, tool execution, and synthesis internally. Your client gets a final answer.

What the server handles for you:

  • Built-in RAG via file_search. Upload documents, create a vector store, and the server searches, retrieves, and grounds responses automatically. No external RAG pipeline needed.
  • MCP integration. Connect any MCP server and the agent discovers and uses its tools. Server-side tool orchestration means your client code doesn't need to know about tool schemas.
  • Multi-step reasoning. The server chains tool calls, not just single-shot inference. Complex workflows that would require dozens of lines of client code happen in one API call.
  • Conversation state. Persistent context across interactions without client-side state management. The server handles it.
  • Reasoning output. Models that expose thinking traces return them as part of the response. You see the reasoning, not just the answer.

Run anywhere, same API

OGX has a pluggable provider architecture. Develop locally with Ollama, deploy to production with vLLM, connect to a managed service when you need to, and the API never changes.

Commercial APIs: OpenAI, Anthropic, Google (Gemini + Vertex AI), Azure OpenAI

Self-hosted inference: Ollama, vLLM, llama.cpp

Cloud platforms: AWS Bedrock, Databricks, NVIDIA NIM, IBM WatsonX, Oracle OCI

Specialized: Fireworks, Together, Groq, Cerebras, SambaNova, RunPod

Vector stores: FAISS, SQLite-vec, Qdrant, Milvus, Chroma, PGVector, Elasticsearch, Infinispan, and more

File processing: MarkItDown, Docling, pypdf. Upload a PDF, Word doc, or text file and OGX converts it to embeddings for your vector store.

Guardrails: Configure moderation_endpoint on the builtin responses provider, then pass extra_body={"guardrails": True} in Responses requests. OGX calls your moderation service inline during generation and blocks unsafe content when violations are flagged.

Configuration is YAML with environment variable substitution. Conditional provider activation means you can declare twenty providers in your config and only the ones with credentials present will start. Switch from development to production by changing environment variables, not code.

Production infrastructure, not a prototype

v1 ships features that matter when you're running OGX for real:

Multi-tenancy. Attribute-based access control on files, vector stores, conversations, and responses. Tenant isolation is built into the storage layer. Run one OGX server for multiple teams or customers.

Observability. Request counts, latency histograms, error rates, token throughput, model-level performance metrics. All exposed via OpenTelemetry. Plug into Prometheus, Grafana, Jaeger, or whatever your ops team already uses.

Gateway-first architecture. OGX delegates rate limiting, CORS, and TLS termination to your infrastructure gateway (Envoy, Kong, Istio, whatever you run). The server focuses on what it's good at: inference routing, agentic orchestration, and API translation.

Structured logging. Every log line is structured key-value via structlog. Parse it, index it, alert on it. No more regex over log files.

Background tasks. Queue long-running responses and cancel them if you need to. Production workloads need job control, not just fire-and-forget.

Get started in 60 seconds

uvx --from 'ogx[starter]' ogx run starter

Then point any OpenAI client at http://localhost:8321/v1 and go.

For production deployments, see the distribution documentation. Distributions are pre-built configurations for specific environments: NVIDIA NIM, IBM WatsonX, Oracle OCI, or build your own.

The road from here

v1 is a foundation, not a finish line. Here's where we're headed:

  • Deeper multi-SDK coverage. Expanding Anthropic and Google API support beyond basic inference to full tool calling and agentic features.
  • Library mode. Embed OGX in-process for latency-sensitive applications that don't need the HTTP layer.
  • More agentic patterns. Richer orchestration primitives, better conversation management, more built-in tools.
  • Performance. Faster inference routing, more efficient agentic loops, better resource utilization across providers.

This is the beginning

OGX started as a set of API specs around one model family. It became a server that speaks every frontier lab's API, runs any model on any infrastructure, and handles the hard parts of agentic orchestration so your application code stays clean.

v1 means we're confident enough to put a number on it. The APIs are stable. The provider ecosystem is mature. The production infrastructure is real. 239 contributors believed in this enough to ship code.

If you've been waiting for the right time to try OGX, this is it.

Get started | Documentation | GitHub | Discord

--- The OGX Team