RAG Benchmarks

OGX implements an OpenAI-compatible API surface for RAG — Files, Vector Stores, and the Responses API with the file_search tool. This page presents benchmark results comparing OGX's RAG quality against OpenAI SaaS, and describes the evaluation methodology.

Summary of Results

We evaluated OGX against OpenAI across four benchmark suites covering retrieval accuracy, end-to-end answer quality, and multi-turn conversational RAG. OGX was tested in two search modes: vector (pure semantic search) and hybrid (semantic + keyword search with reranking). We also tested with an open-source generation model (Gemma 4 31B-IT via vLLM) to validate OGX's model-swappable architecture.

Retrieval Quality (BEIR)

Dataset	Corpus Size	Queries	OpenAI	OGX (vector)	OGX (hybrid)
nfcorpus	3,633	323	0.3156	0.3106	0.3350
scifact	5,183	300	0.7165	0.6943	0.7137
arguana	8,674	1,406	0.2960	0.3765	0.3835
fiqa	57,638	648	0.2862	0.2399	0.2170

Metric: nDCG@10 (normalized Discounted Cumulative Gain at rank 10). Higher is better.

End-to-End RAG Quality

Benchmark	Type	OpenAI (F1)	OGX vector (F1)	OGX hybrid (F1)	OGX hybrid + Gemma 31B (F1)
MultiHOP RAG	Multi-hop reasoning	0.0114	0.0141	0.0141	0.0207
Doc2Dial	Document-grounded dialogue	0.1337	0.0962	0.0966	0.0634

Metric: Token-level F1 (SQuAD-style). Higher is better. OpenAI and OGX vector/hybrid used GPT-4.1 for generation. The Gemma 31B column uses google/gemma-4-31B-it served via vLLM.

Key Takeaways

Hybrid search matches or beats OpenAI on 3 of 4 BEIR datasets (nfcorpus, scifact, arguana), demonstrating that OGX's combination of semantic search, keyword search, and reranking is competitive with OpenAI's proprietary retrieval.
OGX outperforms OpenAI on arguana by 29% (0.3835 vs 0.2960), a dataset that benefits from keyword matching on argumentative text.
fiqa is the gap — OpenAI leads by 19% on this financial QA dataset (0.2862 vs 0.2399), likely due to differences in embedding models and chunking strategies for longer financial documents.
End-to-end RAG scores are low across all backends, reflecting the difficulty of these benchmarks (multi-hop reasoning, document-grounded dialogue) rather than a RAG-specific weakness.
Open-source models plug in without code changes. Gemma 4 31B-IT, served via vLLM, produced coherent answers across both Doc2Dial (1,203 queries, zero empty responses) and MultiHOP RAG (2,556 queries). On MultiHOP, Gemma 31B outperforms GPT-4.1 by +47% F1 — the only benchmark where the open-source model beats the proprietary one.

Additional Retrieval Metrics

Recall@10

Dataset	OpenAI	OGX vector	OGX hybrid
nfcorpus	0.1469	0.1482	0.1646
scifact	0.8067	0.8369	0.8362
arguana	0.6764	0.7610	0.7781
fiqa	0.3117	0.2843	0.2681

MAP@10

Dataset	OpenAI	OGX vector	OGX hybrid
nfcorpus	0.1208	0.1153	0.1286
scifact	0.6818	0.6442	0.6697
arguana	0.1801	0.2542	0.2578
fiqa	0.2319	0.1828	0.1593

Additional End-to-End Metrics

MultiHOP RAG

Multi-hop reasoning over 609 news articles, 2,556 queries.

Metric	OpenAI	OGX vector	OGX hybrid	OGX hybrid + Gemma 31B
F1	0.0114	0.0141	0.0141	0.0207
Exact Match	0.0	0.0	0.0	0.0004
ROUGE-L	0.0116	0.0147	0.0147	0.0203

Gemma 31B outperforms GPT-4.1 on MultiHOP by +47% F1, suggesting its more verbose, synthesized responses better capture multi-hop reasoning across documents.

Doc2Dial

Document-grounded dialogue: 488 documents, 200 conversations, 1,203 total turns.

Metric	OpenAI	OGX vector	OGX hybrid	OGX hybrid + Gemma 31B
F1	0.1337	0.0962	0.0966	0.0634
Exact Match	0.0	0.0	0.0	0.0
ROUGE-L	0.1136	0.0790	0.0794	0.0513

Gemma 31B's lower F1/ROUGE-L scores are driven by response verbosity (~2,500 chars avg vs ~95 char ground truths), not retrieval failure. The model produced zero empty responses across all 1,203 queries.

Analysis

Where OGX wins

arguana (+29.6% nDCG@10): Argumentative text benefits from hybrid search — keyword matching catches specific argument patterns that pure semantic search misses.
nfcorpus (+6.1% nDCG@10): Biomedical domain similarly benefits from hybrid search, where exact term matching (drug names, conditions) complements semantic similarity.
scifact: Effectively tied (0.7165 vs 0.7137, within 0.4%).

Where OpenAI wins

fiqa (+19.3% nDCG@10): The largest corpus (57K docs) with financial domain text. OpenAI's proprietary embedding model likely handles financial terminology better than nomic-embed-text-v1.5, which is a general-purpose model.
Doc2Dial (+39% F1): Precise passage retrieval for dialogue grounding benefits from OpenAI's retrieval system. This is the biggest quality gap.

Vector vs Hybrid on OGX

Dataset	Vector nDCG@10	Hybrid nDCG@10	Winner
nfcorpus	0.3106	0.3350	Hybrid (+7.9%)
scifact	0.6943	0.7137	Hybrid (+2.8%)
arguana	0.3765	0.3835	Hybrid (+1.9%)
fiqa	0.2399	0.2170	Vector (+10.6%)

Hybrid search outperforms vector on 3 of 4 BEIR datasets. The exception is fiqa, where keyword search may add noise for financial opinion queries.

Open-source model (Gemma 31B)

Gemma 4 31B-IT was served via vLLM and connected to OGX as a remote::openai inference provider, using the same retrieval pipeline as the GPT-4.1 runs. On MultiHOP RAG, Gemma 31B outperforms GPT-4.1 by +47% F1 (0.0207 vs 0.0141), suggesting its more verbose, synthesized responses better capture multi-hop reasoning. On Doc2Dial, lower F1/ROUGE-L scores vs GPT-4.1 are driven by response verbosity (avg ~2,500 chars vs ~95 char ground truths), not retrieval failure — the model produced zero empty responses across all 1,203 queries. This demonstrates OGX's model-swappable architecture: open-source models can be plugged in without any code changes.

Generation quality observations

All end-to-end benchmarks show low absolute scores (F1 < 0.15), which is consistent with prior work on these datasets.
Exact Match is 0.0 across all backends — the model generates verbose answers while ground truths are often short extractive spans.
For GPT-4.1 runs, answer quality differences reflect retrieval quality differences, not generation differences. The Gemma 31B results show that model choice matters: verbosity hurts on Doc2Dial (short ground truths) but helps on MultiHOP (multi-document synthesis).

Methodology

API Surface Tested

The benchmark suite exercises three layers of the OpenAI-compatible API:

Files API (POST /v1/files) — Upload documents as individual files
Vector Stores API (POST /v1/vector_stores, POST /v1/vector_stores/{id}/files) — Create vector stores and attach files for automatic chunking and embedding
Vector Stores Search API (POST /v1/vector_stores/{id}/search) — Direct retrieval evaluation (BEIR)
Responses API (POST /v1/responses with file_search tool) — End-to-end RAG with automatic retrieval and generation (MultiHOP, Doc2Dial)

The same benchmark code runs against both OpenAI and OGX — the only difference is the --base-url flag.

OGX Configuration

Component	Configuration
Embedding model	`nomic-ai/nomic-embed-text-v1.5` (sentence-transformers)
Reranker model	`Qwen/Qwen3-Reranker-0.6B` (transformers)
Vector database	Milvus (standalone, remote)
Chunk size	512 tokens
Chunk overlap	128 tokens
Hybrid search	RRF fusion (impact factor 60.0) with reranker
Generation model	GPT-4.1 (via `remote::openai` provider), google/gemma-4-31B-it (via vLLM)

Benchmarks

BEIR (Retrieval-Only)

BEIR is a standard information retrieval benchmark. We evaluate on four datasets spanning biomedical, scientific, argumentative, and financial domains:

nfcorpus — Biomedical information retrieval (3,633 documents, 323 queries)
scifact — Scientific fact verification (5,183 documents, 300 queries)
arguana — Counterargument retrieval (8,674 documents, 1,406 queries)
fiqa — Financial opinion QA (57,638 documents, 648 queries)

BEIR benchmarks are retrieval-only — no LLM is involved. Documents are uploaded via the Files API, indexed in a Vector Store, and queries are evaluated using the Vector Stores Search API. Results are scored with pytrec_eval against BEIR's ground-truth relevance judgments.

MultiHOP RAG (End-to-End)

MultiHOP RAG tests multi-hop reasoning over news articles. The system must retrieve evidence from multiple documents and synthesize an answer. Queries are sent to the Responses API with the file_search tool, and answers are evaluated with Exact Match, token-level F1, and ROUGE-L.

Doc2Dial (Document-Grounded Dialogue)

Doc2Dial evaluates document-grounded dialogue across government and social service domains. Each conversation is a multi-turn exchange between a user and an agent, grounded in a specific document. Conversations are threaded using previous_response_id to maintain context across turns, matching how production chat applications work.

Metrics

Metric	Used In	Description
nDCG@10	BEIR	Normalized Discounted Cumulative Gain at rank 10 — measures ranking quality
Recall@10	BEIR	Fraction of relevant documents retrieved in top 10
MAP@10	BEIR	Mean Average Precision at rank 10
Exact Match	MultiHOP, Doc2Dial	Whether the prediction exactly matches the ground truth
F1	MultiHOP, Doc2Dial	Token-level F1 (SQuAD-style precision/recall on answer tokens)
ROUGE-L	MultiHOP, Doc2Dial	Longest common subsequence overlap between prediction and ground truth

Running the Benchmarks

The benchmark suite lives in benchmarking/rag/. See the README for full setup instructions.

Quick Start

cd benchmarking/rag
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# Copy and configure environment
cp .env.example .env
# Edit .env with your OPENAI_API_KEY

Run Against OpenAI

python run_benchmark.py --benchmark beir --base-url https://api.openai.com/v1
python run_benchmark.py --benchmark multihop --base-url https://api.openai.com/v1
python run_benchmark.py --benchmark doc2dial --base-url https://api.openai.com/v1

Run Against OGX

# Start OGX with Milvus backend
bash start_stack.sh

# Vector search mode
python run_benchmark.py --benchmark beir --base-url http://localhost:8321/v1

# Hybrid search mode (recommended)
python run_benchmark.py --benchmark beir --base-url http://localhost:8321/v1 --search-mode hybrid

Compare Results

After running benchmarks against multiple backends:

python compare_results.py              # Table output
python compare_results.py --format csv # CSV for spreadsheets

Extending the Benchmarks

The suite is designed to be extended with new benchmarks. Each benchmark implements the BenchmarkRunner interface:

from benchmarks.base import BenchmarkRunner

class MyBenchmark(BenchmarkRunner):
    name = "my_benchmark"

    def download(self) -> None:
        """Download or load the dataset."""
        ...

    def load_data(self) -> None:
        """Parse the dataset into corpus, queries, and ground truths."""
        ...

    def ingest(self) -> None:
        """Upload corpus to Files API and create a Vector Store."""
        ...

    def evaluate(self) -> dict:
        """Run queries and compute metrics."""
        ...

Summary of Results​

Retrieval Quality (BEIR)​

End-to-End RAG Quality​

Key Takeaways​

Additional Retrieval Metrics​

Recall@10​

MAP@10​

Additional End-to-End Metrics​

MultiHOP RAG​

Doc2Dial​

Analysis​

Where OGX wins​

Where OpenAI wins​

Vector vs Hybrid on OGX​

Open-source model (Gemma 31B)​

Generation quality observations​

Methodology​

API Surface Tested​

OGX Configuration​

Benchmarks​

BEIR (Retrieval-Only)​

MultiHOP RAG (End-to-End)​

Doc2Dial (Document-Grounded Dialogue)​

Metrics​

Running the Benchmarks​

Quick Start​

Run Against OpenAI​

Run Against OGX​

Compare Results​

Extending the Benchmarks​