Benchmarking Contextual RAG Agents - The Technology that Powers the Contextual AI Platform

Benchmarking Contextual RAG Agents – The Technology that Powers the Contextual AI Platform

January 15, 2025

By Amanpreet Singh

Enterprises are racing to harness generative AI to automate complex workflows, boost employee productivity, and deliver transformative value to customers. Yet achieving these outcomes demands far more than deploying a powerful LLM or using basic retrieval-augmented generation (RAG). Enterprise environments present unique challenges that cannot be addressed by cobbling together isolated components – every part of the pipeline must perform reliably amid complex, rapidly evolving business conditions. This is where our approach of building systems over models becomes indispensable: by optimizing the entire agentic system end-to-end, we deliver production-grade performance across the diverse landscape of enterprise needs.

Our Contextual RAG Agents, as demonstrated above, embody this systems-first mindset through breakthrough advances in RAG 2.0. We developed a unified pipeline that harmonizes document understanding, sophisticated retrieval, and grounded language modeling. While each component achieves state-of-the-art performance individually – on BIRD for structured reasoning, RAG-QA Arena for end-to-end RAG, OmniDocBench for document understanding, and BEIR for retrieval – the real breakthrough emerges from their seamless integration. Improvements cascade through the pipeline, enabling robust performance even in challenging real-world conditions. While this blog post focuses on evaluating our out-of-the-box agent, the Contextual AI Platform enables even greater performance gains through advanced tuning and alignment capabilities that specialize the Contextual RAG Agent for enterprise-specific needs.

The complexity of enterprise AI deployments demands a comprehensive approach to evaluation – one that goes beyond isolated model metrics to assess real-world performance across every critical component. That’s why we measure everything from end-to-end RAG performance and multi-modal document understanding to structured data retrieval, and grounded language modeling. By benchmarking both at the component level and in integrated workflows across diverse customer domains, we capture the challenges enterprises actually face, from handling complex documents to ensuring faithful, reliable responses. In doing so, we validate our systems-first principle: it’s not about any single component’s performance, but about delivering consistent, production-ready solutions that drive tangible business value.

End-to-End Evaluation

The true test of our systems-first approach lies in comprehensive, end-to-end evaluation. Through the RAG-QA Arena benchmark, we evaluate our Contextual RAG Agent against industry-leading baselines. RAG-QA Arena provides a comprehensive test environment with diverse domain corpora, human-annotated responses, and realistic query distributions.

Our evaluation compares against strong baselines combining Cohere’s most accurate retrieval models (embed-english-v3 + rerank-v3.5) with state-of-the-art language models including Claude-3.5, GPT-4o, and Llama-3.1-Instruct. The results are definitive: our system achieves 71.2% performance, a 5.4% improvement over the strongest baseline (Cohere + Claude-3.5-Sonnet at 66.8%). This improvement demonstrates the power of our advanced retrieval and reranking capabilities combined with grounded language modeling – our system excels at finding relevant information and reasoning over it to provide accurate, helpful responses.

Document Understanding

Document understanding forms the cornerstone of any advanced AI system working with unstructured data. A system’s effectiveness fundamentally depends on its ability to both parse and comprehend input documents – high reliability here reduces hallucinations of the full system. We’ve achieved a breakthrough in this crucial capability, developing exceptional proficiency in processing complex, multimodal documents.

For document parsing, a comprehensive evaluation on OmniDocBench demonstrates significant performance advantages across every core dimension. This benchmark comprises a challenging collection of documents including highly dense and technical text, varying layouts and large complex tables that need to be inferred from raw images. Contextual’s Document Understanding system achieves an average score of 87.0, outperforming the next-best commercial solution (LlamaParse Premium) by 4.6% on average. This superiority is consistent across all dimensions of precise text extraction, reading order preservation, and advanced table structure comprehension.

Going beyond raw document parsing to comprehension, we’ve engineered our system for optimal retrieval effectiveness through sophisticated architectural innovations like section hierarchy understanding and multimodal indexing. Our document understanding system can masterfully comprehend the full spectrum of document elements, from intricate schematics and technical diagrams to nested tables and dynamic charts, regardless of document length and context window budgets. This ensures downstream RAG agents receive properly contextualized information, maximizing the performance of the entire system.

Structured and Unstructured Data Retrieval

For AI to unlock transformative value in enterprise environments, agents must masterfully extract insights from both structured and unstructured data sources. While unstructured content represents over 80% of enterprise information, mission-critical data often resides exclusively in structured databases – from real-time inventory positions to compliance-mandated financial metrics. This duality creates a clear imperative: enterprise AI must excel across both domains simultaneously.

Contextual RAG Agent delivers a breakthrough, unified approach that seamlessly bridges this divide. Powered by advanced document understanding technology, our agent processes the full spectrum of enterprise content – from complex PDFs and technical documentation to relational databases and specialized data formats. Our sophisticated capabilities ensure robust extraction and retrieval of all content elements, from narrative text and tabular data to embedded diagrams and statistical visualizations, establishing comprehensive information access across the enterprise data landscape.

For unstructured data processing, we’ve engineered a mixture-of-retrievers system enhanced by our state-of-the-art reranker. The reranker plays a crucial role in determining which knowledge chunks enter the language model’s limited context window, driving end-to-end accuracy beyond what’s possible through simple candidate set expansion. On the industry-standard BEIR benchmark, our reranker achieves a score of 61.2, outperforming the next best solution (Voyage-v2 at 58.3) by 2.9% across 14 diverse datasets.

For structured data access, we’ve developed sophisticated code generation capabilities that leverage inference-time computation for precise SQL query formulation. This advancement goes beyond simple query writing – it represents a deep understanding of data schemas and relationships that enables exact information retrieval. The system sets new standards on the BIRD benchmark with 73.5% execution accuracy, demonstrating unparalleled ability to handle complex enterprise data structures.

Grounded Language Modeling

In enterprise environments, reliability extends far beyond simply providing good answers – it requires responses to be grounded in trusted knowledge. Built with Meta’s Llama 3.3, our grounded language model (GLM) represents a breakthrough in this direction, engineered specifically to prioritize faithfulness to in-context retrievals over parametric knowledge. This architectural choice isn’t merely technical; it’s essential for systems that enterprises rely on for mission-critical workflows.

Enterprise AI requires responses that are verifiably true and reliable, not based on information implicitly learned during pre-training. While conventional models often generate confident but potentially incorrect answers drawn from their training data, our GLM is engineered to provide responses only when they can be substantiated. It clearly acknowledges uncertainty with “I don’t know” responses when needed – preventing the cascade of costly mistakes that can stem from false confidence. Evaluations against leading foundation model baselines on our proprietary customer benchmarks demonstrate GLM’s consistent and significant improvements in grounded reasoning across diverse enterprise contexts.

Reliable Evaluation

Beyond the agent and its components, building robust, production-ready AI agents also requires continually pushing the boundaries of how we evaluate their capabilities. While our benchmarks for document understanding, retrieval, and grounded language modeling capture the core enterprise challenges, certain dimensions—such as fine-grained correctness, stylistic fidelity, or domain-specific constraints—remain difficult to measure. Through LMUnit and our natural language unit test paradigm, we’re establishing new standards for AI agents validation in mission-critical environments. While we leverage LMUnit extensively in our internal testing, the Contextual RAG Agent benchmarks presented here focus on industry-standard benchmarks and traditional evaluation methodologies that enable transparent comparisons against state-of-the-art systems in the field. LMUnit achieves state-of-the-art performance on Flask and BigGenBench while matching or exceeding GPT-4o and Claude-3.5-Sonnet across RewardBench, InfoBench, and LFQA evaluations, demonstrating our commitment to rigorous, human-aligned AI evaluation.

Towards enterprise-grade RAG Agents

Our work demonstrates how building systems with robust core capabilities—from document understanding to grounded language modeling—creates the foundation for reliable enterprise RAG agents. Built in-house, it deploys within customer VPCs to ensure ironclad security while maintaining superior performance. Through our sophisticated specialization pipelines that simulate diverse enterprise environments and provide continuous feedback, we enable the agent to evolve and adapt with enterprise requirements.

This unified approach—combining security, adaptability, and rigorous evaluation—represents our initial step into agentic RAG systems, demonstrating our commitment to building AI agents that not only excel in benchmarks but also deliver transformative value in complex enterprise operations.

Ready to experience the power of next-generation RAG agents? The Contextual AI Platform, featuring our Contextual RAG Agent with state-of-the-art document understanding, retrieval, and reasoning capabilities across both structured and unstructured data—with options to specialize further for your domain—is now generally available through our comprehensive API suite. Join the growing community of enterprises leveraging our breakthrough technology today — request access here and start building with production-grade AI.

¹Our most significant increase over the paper’s baselines come from the use of modern retrieve + rerank systems, as opposed to the paper’s retrieval-only baselines (using Colbert-v2). We evaluate and report performance on RAG-QA Arena using the open-source code from AWS, and report results which are comparable to Table 3 of the benchmark paper, with the minor difference that we report win-rate (W+0.5T), as opposed to separately W and W+T, against the LFRQA human reference. We use the same configuration (100 retrievals, top-8 reranked docs, identical system prompt) across all evaluations and maintain consistency with the paper in ensuring that the generated responses are concise (~60 words), in-part due to the known length biases of LLM-as-a-Judge metrics. Although the figure only includes system-level comparisons for brevity, our ablations show that both our retrieval and generation components independently outperform alternatives.

² Our groundedness metric is a multi-step pipeline that first decomposes each model-generated response into atomic claims, then independently assesses the correctness of each claim. Internal evaluations confirm this approach aligns well with human annotations and demonstrates high precision/recall in detecting hallucinations.

Thank you for your interest – we will be in touch. In the meantime, you can follow us on LinkedIn and Twitter.