RAG Benchmarks
OGX implements an OpenAI-compatible API surface for RAG — Files, Vector Stores, and the Responses API with the file_search tool. This page presents benchmark results comparing OGX's RAG quality against OpenAI SaaS, and describes the evaluation methodology.
Summary of Results
We evaluated OGX against OpenAI across four benchmark suites covering retrieval accuracy, end-to-end answer quality, and multi-turn conversational RAG. OGX was tested in two search modes: vector (pure semantic search) and hybrid (semantic + keyword search with reranking). We also tested with an open-source generation model (Gemma 4 31B-IT via vLLM) to validate OGX's model-swappable architecture.
Retrieval Quality (BEIR)
| Dataset | Corpus Size | Queries | OpenAI | OGX (vector) | OGX (hybrid) |
|---|---|---|---|---|---|
| nfcorpus | 3,633 | 323 | 0.3156 | 0.3106 | 0.3350 |
| scifact | 5,183 | 300 | 0.7165 | 0.6943 | 0.7137 |
| arguana | 8,674 | 1,406 | 0.2960 | 0.3765 | 0.3835 |
| fiqa | 57,638 | 648 | 0.2862 | 0.2399 | 0.2170 |
Metric: nDCG@10 (normalized Discounted Cumulative Gain at rank 10). Higher is better.
End-to-End RAG Quality
| Benchmark | Type | OpenAI (F1) | OGX vector (F1) | OGX hybrid (F1) | OGX hybrid + Gemma 31B (F1) |
|---|---|---|---|---|---|
| MultiHOP RAG | Multi-hop reasoning | 0.0114 | 0.0141 | 0.0141 | 0.0207 |
| Doc2Dial | Document-grounded dialogue | 0.1337 | 0.0962 | 0.0966 | 0.0634 |
Metric: Token-level F1 (SQuAD-style). Higher is better. OpenAI and OGX vector/hybrid used GPT-4.1 for generation. The Gemma 31B column uses google/gemma-4-31B-it served via vLLM.
Key Takeaways
- Hybrid search matches or beats OpenAI on 3 of 4 BEIR datasets (nfcorpus, scifact, arguana), demonstrating that OGX's combination of semantic search, keyword search, and reranking is competitive with OpenAI's proprietary retrieval.
- OGX outperforms OpenAI on arguana by 29% (0.3835 vs 0.2960), a dataset that benefits from keyword matching on argumentative text.
- fiqa is the gap — OpenAI leads by 19% on this financial QA dataset (0.2862 vs 0.2399), likely due to differences in embedding models and chunking strategies for longer financial documents.
- End-to-end RAG scores are low across all backends, reflecting the difficulty of these benchmarks (multi-hop reasoning, document-grounded dialogue) rather than a RAG-specific weakness.
- Open-source models plug in without code changes. Gemma 4 31B-IT, served via vLLM, produced coherent answers across both Doc2Dial (1,203 queries, zero empty responses) and MultiHOP RAG (2,556 queries). On MultiHOP, Gemma 31B outperforms GPT-4.1 by +47% F1 — the only benchmark where the open-source model beats the proprietary one.
Additional Retrieval Metrics
Recall@10
| Dataset | OpenAI | OGX vector | OGX hybrid |
|---|---|---|---|
| nfcorpus | 0.1469 | 0.1482 | 0.1646 |
| scifact | 0.8067 | 0.8369 | 0.8362 |
| arguana | 0.6764 | 0.7610 | 0.7781 |
| fiqa | 0.3117 | 0.2843 | 0.2681 |
MAP@10
| Dataset | OpenAI | OGX vector | OGX hybrid |
|---|---|---|---|
| nfcorpus | 0.1208 | 0.1153 | 0.1286 |
| scifact | 0.6818 | 0.6442 | 0.6697 |
| arguana | 0.1801 | 0.2542 | 0.2578 |
| fiqa | 0.2319 | 0.1828 | 0.1593 |
Additional End-to-End Metrics
MultiHOP RAG
Multi-hop reasoning over 609 news articles, 2,556 queries.
| Metric | OpenAI | OGX vector | OGX hybrid | OGX hybrid + Gemma 31B |
|---|---|---|---|---|
| F1 | 0.0114 | 0.0141 | 0.0141 | 0.0207 |
| Exact Match | 0.0 | 0.0 | 0.0 | 0.0004 |
| ROUGE-L | 0.0116 | 0.0147 | 0.0147 | 0.0203 |
Gemma 31B outperforms GPT-4.1 on MultiHOP by +47% F1, suggesting its more verbose, synthesized responses better capture multi-hop reasoning across documents.
Doc2Dial
Document-grounded dialogue: 488 documents, 200 conversations, 1,203 total turns.
| Metric | OpenAI | OGX vector | OGX hybrid | OGX hybrid + Gemma 31B |
|---|---|---|---|---|
| F1 | 0.1337 | 0.0962 | 0.0966 | 0.0634 |
| Exact Match | 0.0 | 0.0 | 0.0 | 0.0 |
| ROUGE-L | 0.1136 | 0.0790 | 0.0794 | 0.0513 |
Gemma 31B's lower F1/ROUGE-L scores are driven by response verbosity (~2,500 chars avg vs ~95 char ground truths), not retrieval failure. The model produced zero empty responses across all 1,203 queries.
Analysis
Where OGX wins
- arguana (+29.6% nDCG@10): Argumentative text benefits from hybrid search — keyword matching catches specific argument patterns that pure semantic search misses.
- nfcorpus (+6.1% nDCG@10): Biomedical domain similarly benefits from hybrid search, where exact term matching (drug names, conditions) complements semantic similarity.
- scifact: Effectively tied (0.7165 vs 0.7137, within 0.4%).
Where OpenAI wins
- fiqa (+19.3% nDCG@10): The largest corpus (57K docs) with financial domain text. OpenAI's proprietary embedding model likely handles financial terminology better than nomic-embed-text-v1.5, which is a general-purpose model.
- Doc2Dial (+39% F1): Precise passage retrieval for dialogue grounding benefits from OpenAI's retrieval system. This is the biggest quality gap.
Vector vs Hybrid on OGX
| Dataset | Vector nDCG@10 | Hybrid nDCG@10 | Winner |
|---|---|---|---|
| nfcorpus | 0.3106 | 0.3350 | Hybrid (+7.9%) |
| scifact | 0.6943 | 0.7137 | Hybrid (+2.8%) |
| arguana | 0.3765 | 0.3835 | Hybrid (+1.9%) |
| fiqa | 0.2399 | 0.2170 | Vector (+10.6%) |
Hybrid search outperforms vector on 3 of 4 BEIR datasets. The exception is fiqa, where keyword search may add noise for financial opinion queries.
Open-source model (Gemma 31B)
Gemma 4 31B-IT was served via vLLM and connected to OGX as a remote::openai inference provider, using the same retrieval pipeline as the GPT-4.1 runs. On MultiHOP RAG, Gemma 31B outperforms GPT-4.1 by +47% F1 (0.0207 vs 0.0141), suggesting its more verbose, synthesized responses better capture multi-hop reasoning. On Doc2Dial, lower F1/ROUGE-L scores vs GPT-4.1 are driven by response verbosity (avg ~2,500 chars vs ~95 char ground truths), not retrieval failure — the model produced zero empty responses across all 1,203 queries. This demonstrates OGX's model-swappable architecture: open-source models can be plugged in without any code changes.
Generation quality observations
- All end-to-end benchmarks show low absolute scores (F1 < 0.15), which is consistent with prior work on these datasets.
- Exact Match is 0.0 across all backends — the model generates verbose answers while ground truths are often short extractive spans.
- For GPT-4.1 runs, answer quality differences reflect retrieval quality differences, not generation differences. The Gemma 31B results show that model choice matters: verbosity hurts on Doc2Dial (short ground truths) but helps on MultiHOP (multi-document synthesis).
Methodology
API Surface Tested
The benchmark suite exercises three layers of the OpenAI-compatible API:
- Files API (
POST /v1/files) — Upload documents as individual files - Vector Stores API (
POST /v1/vector_stores,POST /v1/vector_stores/{id}/files) — Create vector stores and attach files for automatic chunking and embedding - Vector Stores Search API (
POST /v1/vector_stores/{id}/search) — Direct retrieval evaluation (BEIR) - Responses API (
POST /v1/responseswithfile_searchtool) — End-to-end RAG with automatic retrieval and generation (MultiHOP, Doc2Dial)
The same benchmark code runs against both OpenAI and OGX — the only difference is the --base-url flag.
OGX Configuration
| Component | Configuration |
|---|---|
| Embedding model | nomic-ai/nomic-embed-text-v1.5 (sentence-transformers) |
| Reranker model | Qwen/Qwen3-Reranker-0.6B (transformers) |
| Vector database | Milvus (standalone, remote) |
| Chunk size | 512 tokens |
| Chunk overlap | 128 tokens |
| Hybrid search | RRF fusion (impact factor 60.0) with reranker |
| Generation model | GPT-4.1 (via remote::openai provider), google/gemma-4-31B-it (via vLLM) |
Benchmarks
BEIR (Retrieval-Only)
BEIR is a standard information retrieval benchmark. We evaluate on four datasets spanning biomedical, scientific, argumentative, and financial domains:
- nfcorpus — Biomedical information retrieval (3,633 documents, 323 queries)
- scifact — Scientific fact verification (5,183 documents, 300 queries)
- arguana — Counterargument retrieval (8,674 documents, 1,406 queries)
- fiqa — Financial opinion QA (57,638 documents, 648 queries)
BEIR benchmarks are retrieval-only — no LLM is involved. Documents are uploaded via the Files API, indexed in a Vector Store, and queries are evaluated using the Vector Stores Search API. Results are scored with pytrec_eval against BEIR's ground-truth relevance judgments.
MultiHOP RAG (End-to-End)
MultiHOP RAG tests multi-hop reasoning over news articles. The system must retrieve evidence from multiple documents and synthesize an answer. Queries are sent to the Responses API with the file_search tool, and answers are evaluated with Exact Match, token-level F1, and ROUGE-L.
Doc2Dial (Document-Grounded Dialogue)
Doc2Dial evaluates document-grounded dialogue across government and social service domains. Each conversation is a multi-turn exchange between a user and an agent, grounded in a specific document. Conversations are threaded using previous_response_id to maintain context across turns, matching how production chat applications work.
Metrics
| Metric | Used In | Description |
|---|---|---|
| nDCG@10 | BEIR | Normalized Discounted Cumulative Gain at rank 10 — measures ranking quality |
| Recall@10 | BEIR | Fraction of relevant documents retrieved in top 10 |
| MAP@10 | BEIR | Mean Average Precision at rank 10 |
| Exact Match | MultiHOP, Doc2Dial | Whether the prediction exactly matches the ground truth |
| F1 | MultiHOP, Doc2Dial | Token-level F1 (SQuAD-style precision/recall on answer tokens) |
| ROUGE-L | MultiHOP, Doc2Dial | Longest common subsequence overlap between prediction and ground truth |
Running the Benchmarks
The benchmark suite lives in benchmarking/rag/. See the README for full setup instructions.
Quick Start
cd benchmarking/rag
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
# Copy and configure environment
cp .env.example .env
# Edit .env with your OPENAI_API_KEY
Run Against OpenAI
python run_benchmark.py --benchmark beir --base-url https://api.openai.com/v1
python run_benchmark.py --benchmark multihop --base-url https://api.openai.com/v1
python run_benchmark.py --benchmark doc2dial --base-url https://api.openai.com/v1
Run Against OGX
# Start OGX with Milvus backend
bash start_stack.sh
# Vector search mode
python run_benchmark.py --benchmark beir --base-url http://localhost:8321/v1
# Hybrid search mode (recommended)
python run_benchmark.py --benchmark beir --base-url http://localhost:8321/v1 --search-mode hybrid
Compare Results
After running benchmarks against multiple backends:
python compare_results.py # Table output
python compare_results.py --format csv # CSV for spreadsheets
Extending the Benchmarks
The suite is designed to be extended with new benchmarks. Each benchmark implements the BenchmarkRunner interface:
from benchmarks.base import BenchmarkRunner
class MyBenchmark(BenchmarkRunner):
name = "my_benchmark"
def download(self) -> None:
"""Download or load the dataset."""
...
def load_data(self) -> None:
"""Parse the dataset into corpus, queries, and ground truths."""
...
def ingest(self) -> None:
"""Upload corpus to Files API and create a Vector Store."""
...
def evaluate(self) -> dict:
"""Run queries and compute metrics."""
...
Register the new benchmark in run_benchmark.py and it will be available via --benchmark my_benchmark.