Skip to main content

RAG Benchmarks

OGX implements an OpenAI-compatible API surface for RAG — Files, Vector Stores, and the Responses API with the file_search tool. This page presents benchmark results comparing OGX's RAG quality against OpenAI SaaS, and describes the evaluation methodology.

Summary of Results

We evaluated OGX against OpenAI across four benchmark suites covering retrieval accuracy, end-to-end answer quality, and multi-turn conversational RAG. OGX was tested in two search modes: vector (pure semantic search) and hybrid (semantic + keyword search with reranking). We also tested with an open-source generation model (Gemma 4 31B-IT via vLLM) to validate OGX's model-swappable architecture.

Retrieval Quality (BEIR)

DatasetCorpus SizeQueriesOpenAIOGX (vector)OGX (hybrid)
nfcorpus3,6333230.31560.31060.3350
scifact5,1833000.71650.69430.7137
arguana8,6741,4060.29600.37650.3835
fiqa57,6386480.28620.23990.2170

Metric: nDCG@10 (normalized Discounted Cumulative Gain at rank 10). Higher is better.

End-to-End RAG Quality

BenchmarkTypeOpenAI (F1)OGX vector (F1)OGX hybrid (F1)OGX hybrid + Gemma 31B (F1)
MultiHOP RAGMulti-hop reasoning0.01140.01410.01410.0207
Doc2DialDocument-grounded dialogue0.13370.09620.09660.0634

Metric: Token-level F1 (SQuAD-style). Higher is better. OpenAI and OGX vector/hybrid used GPT-4.1 for generation. The Gemma 31B column uses google/gemma-4-31B-it served via vLLM.

Key Takeaways

  • Hybrid search matches or beats OpenAI on 3 of 4 BEIR datasets (nfcorpus, scifact, arguana), demonstrating that OGX's combination of semantic search, keyword search, and reranking is competitive with OpenAI's proprietary retrieval.
  • OGX outperforms OpenAI on arguana by 29% (0.3835 vs 0.2960), a dataset that benefits from keyword matching on argumentative text.
  • fiqa is the gap — OpenAI leads by 19% on this financial QA dataset (0.2862 vs 0.2399), likely due to differences in embedding models and chunking strategies for longer financial documents.
  • End-to-end RAG scores are low across all backends, reflecting the difficulty of these benchmarks (multi-hop reasoning, document-grounded dialogue) rather than a RAG-specific weakness.
  • Open-source models plug in without code changes. Gemma 4 31B-IT, served via vLLM, produced coherent answers across both Doc2Dial (1,203 queries, zero empty responses) and MultiHOP RAG (2,556 queries). On MultiHOP, Gemma 31B outperforms GPT-4.1 by +47% F1 — the only benchmark where the open-source model beats the proprietary one.

Additional Retrieval Metrics

Recall@10

DatasetOpenAIOGX vectorOGX hybrid
nfcorpus0.14690.14820.1646
scifact0.80670.83690.8362
arguana0.67640.76100.7781
fiqa0.31170.28430.2681

MAP@10

DatasetOpenAIOGX vectorOGX hybrid
nfcorpus0.12080.11530.1286
scifact0.68180.64420.6697
arguana0.18010.25420.2578
fiqa0.23190.18280.1593

Additional End-to-End Metrics

MultiHOP RAG

Multi-hop reasoning over 609 news articles, 2,556 queries.

MetricOpenAIOGX vectorOGX hybridOGX hybrid + Gemma 31B
F10.01140.01410.01410.0207
Exact Match0.00.00.00.0004
ROUGE-L0.01160.01470.01470.0203

Gemma 31B outperforms GPT-4.1 on MultiHOP by +47% F1, suggesting its more verbose, synthesized responses better capture multi-hop reasoning across documents.

Doc2Dial

Document-grounded dialogue: 488 documents, 200 conversations, 1,203 total turns.

MetricOpenAIOGX vectorOGX hybridOGX hybrid + Gemma 31B
F10.13370.09620.09660.0634
Exact Match0.00.00.00.0
ROUGE-L0.11360.07900.07940.0513

Gemma 31B's lower F1/ROUGE-L scores are driven by response verbosity (~2,500 chars avg vs ~95 char ground truths), not retrieval failure. The model produced zero empty responses across all 1,203 queries.

Analysis

Where OGX wins

  • arguana (+29.6% nDCG@10): Argumentative text benefits from hybrid search — keyword matching catches specific argument patterns that pure semantic search misses.
  • nfcorpus (+6.1% nDCG@10): Biomedical domain similarly benefits from hybrid search, where exact term matching (drug names, conditions) complements semantic similarity.
  • scifact: Effectively tied (0.7165 vs 0.7137, within 0.4%).

Where OpenAI wins

  • fiqa (+19.3% nDCG@10): The largest corpus (57K docs) with financial domain text. OpenAI's proprietary embedding model likely handles financial terminology better than nomic-embed-text-v1.5, which is a general-purpose model.
  • Doc2Dial (+39% F1): Precise passage retrieval for dialogue grounding benefits from OpenAI's retrieval system. This is the biggest quality gap.

Vector vs Hybrid on OGX

DatasetVector nDCG@10Hybrid nDCG@10Winner
nfcorpus0.31060.3350Hybrid (+7.9%)
scifact0.69430.7137Hybrid (+2.8%)
arguana0.37650.3835Hybrid (+1.9%)
fiqa0.23990.2170Vector (+10.6%)

Hybrid search outperforms vector on 3 of 4 BEIR datasets. The exception is fiqa, where keyword search may add noise for financial opinion queries.

Open-source model (Gemma 31B)

Gemma 4 31B-IT was served via vLLM and connected to OGX as a remote::openai inference provider, using the same retrieval pipeline as the GPT-4.1 runs. On MultiHOP RAG, Gemma 31B outperforms GPT-4.1 by +47% F1 (0.0207 vs 0.0141), suggesting its more verbose, synthesized responses better capture multi-hop reasoning. On Doc2Dial, lower F1/ROUGE-L scores vs GPT-4.1 are driven by response verbosity (avg ~2,500 chars vs ~95 char ground truths), not retrieval failure — the model produced zero empty responses across all 1,203 queries. This demonstrates OGX's model-swappable architecture: open-source models can be plugged in without any code changes.

Generation quality observations

  • All end-to-end benchmarks show low absolute scores (F1 < 0.15), which is consistent with prior work on these datasets.
  • Exact Match is 0.0 across all backends — the model generates verbose answers while ground truths are often short extractive spans.
  • For GPT-4.1 runs, answer quality differences reflect retrieval quality differences, not generation differences. The Gemma 31B results show that model choice matters: verbosity hurts on Doc2Dial (short ground truths) but helps on MultiHOP (multi-document synthesis).

Methodology

API Surface Tested

The benchmark suite exercises three layers of the OpenAI-compatible API:

  1. Files API (POST /v1/files) — Upload documents as individual files
  2. Vector Stores API (POST /v1/vector_stores, POST /v1/vector_stores/{id}/files) — Create vector stores and attach files for automatic chunking and embedding
  3. Vector Stores Search API (POST /v1/vector_stores/{id}/search) — Direct retrieval evaluation (BEIR)
  4. Responses API (POST /v1/responses with file_search tool) — End-to-end RAG with automatic retrieval and generation (MultiHOP, Doc2Dial)

The same benchmark code runs against both OpenAI and OGX — the only difference is the --base-url flag.

OGX Configuration

ComponentConfiguration
Embedding modelnomic-ai/nomic-embed-text-v1.5 (sentence-transformers)
Reranker modelQwen/Qwen3-Reranker-0.6B (transformers)
Vector databaseMilvus (standalone, remote)
Chunk size512 tokens
Chunk overlap128 tokens
Hybrid searchRRF fusion (impact factor 60.0) with reranker
Generation modelGPT-4.1 (via remote::openai provider), google/gemma-4-31B-it (via vLLM)

Benchmarks

BEIR (Retrieval-Only)

BEIR is a standard information retrieval benchmark. We evaluate on four datasets spanning biomedical, scientific, argumentative, and financial domains:

  • nfcorpus — Biomedical information retrieval (3,633 documents, 323 queries)
  • scifact — Scientific fact verification (5,183 documents, 300 queries)
  • arguana — Counterargument retrieval (8,674 documents, 1,406 queries)
  • fiqa — Financial opinion QA (57,638 documents, 648 queries)

BEIR benchmarks are retrieval-only — no LLM is involved. Documents are uploaded via the Files API, indexed in a Vector Store, and queries are evaluated using the Vector Stores Search API. Results are scored with pytrec_eval against BEIR's ground-truth relevance judgments.

MultiHOP RAG (End-to-End)

MultiHOP RAG tests multi-hop reasoning over news articles. The system must retrieve evidence from multiple documents and synthesize an answer. Queries are sent to the Responses API with the file_search tool, and answers are evaluated with Exact Match, token-level F1, and ROUGE-L.

Doc2Dial (Document-Grounded Dialogue)

Doc2Dial evaluates document-grounded dialogue across government and social service domains. Each conversation is a multi-turn exchange between a user and an agent, grounded in a specific document. Conversations are threaded using previous_response_id to maintain context across turns, matching how production chat applications work.

Metrics

MetricUsed InDescription
nDCG@10BEIRNormalized Discounted Cumulative Gain at rank 10 — measures ranking quality
Recall@10BEIRFraction of relevant documents retrieved in top 10
MAP@10BEIRMean Average Precision at rank 10
Exact MatchMultiHOP, Doc2DialWhether the prediction exactly matches the ground truth
F1MultiHOP, Doc2DialToken-level F1 (SQuAD-style precision/recall on answer tokens)
ROUGE-LMultiHOP, Doc2DialLongest common subsequence overlap between prediction and ground truth

Running the Benchmarks

The benchmark suite lives in benchmarking/rag/. See the README for full setup instructions.

Quick Start

cd benchmarking/rag
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# Copy and configure environment
cp .env.example .env
# Edit .env with your OPENAI_API_KEY

Run Against OpenAI

python run_benchmark.py --benchmark beir --base-url https://api.openai.com/v1
python run_benchmark.py --benchmark multihop --base-url https://api.openai.com/v1
python run_benchmark.py --benchmark doc2dial --base-url https://api.openai.com/v1

Run Against OGX

# Start OGX with Milvus backend
bash start_stack.sh

# Vector search mode
python run_benchmark.py --benchmark beir --base-url http://localhost:8321/v1

# Hybrid search mode (recommended)
python run_benchmark.py --benchmark beir --base-url http://localhost:8321/v1 --search-mode hybrid

Compare Results

After running benchmarks against multiple backends:

python compare_results.py              # Table output
python compare_results.py --format csv # CSV for spreadsheets

Extending the Benchmarks

The suite is designed to be extended with new benchmarks. Each benchmark implements the BenchmarkRunner interface:

from benchmarks.base import BenchmarkRunner

class MyBenchmark(BenchmarkRunner):
name = "my_benchmark"

def download(self) -> None:
"""Download or load the dataset."""
...

def load_data(self) -> None:
"""Parse the dataset into corpus, queries, and ground truths."""
...

def ingest(self) -> None:
"""Upload corpus to Files API and create a Vector Store."""
...

def evaluate(self) -> dict:
"""Run queries and compute metrics."""
...

Register the new benchmark in run_benchmark.py and it will be available via --benchmark my_benchmark.