The Retail Media AI Benchmark: Accuracy Separates the Real from the Hype

Reading time: 5 minutes
‍

We tested 4 AI agents on 91 real Amazon ad workflows. The Xnurta Insight Agent hit 86.1% data accuracy. The strongest foundation model: 35.3%. Here's what that gap means for your campaigns.

Executive Summary

In the last 18 months, nearly every retail media platform has launched an "AI agent." For brands and agencies, this flood of new tools has created a frustrating problem: parsing through identical marketing copy to figure out objective quality.

‍

The dominant public benchmarks for large language models, like MMLU or HumanEval, measure closed-form knowledge against textbook ground truths. However, they fail to answer the critical question for retail media buyers: Can this model sit in front of a live advertiser account and produce usable, accurate answers? To bring desperately needed rigor to an industry drowning in hype, we developed a benchmark to evaluate four agents; Xnurta Insight Agent, Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5 on real-world Amazon advertising workflows.

‍

The headline finding: On the quality of what they write, foundation models like Claude Opus 4.7 are statistically tied with domain-native agents. However, on the accuracy of the underlying numbers, foundation models collapse. The Xnurta Insight Agent answered 86.1% of data accuracy questions correctly, compared to just 35.3% for the strongest foundation model. The data tells a decisive story; broad LLMs can write plausible analysis, but only a domain-native agent can be trusted to get the numbers right.

Methodology & The Dual-Grader Design

The benchmark utilizes 91 questions drawn from live Amazon advertiser workflows, ranging from portfolio analysis to budget pacing.

Performance was measured across five dimensions:

G1 (Response Fit): Did the agent follow the formatting and scope constraints?
G2 (Analysis Scope): Did the agent reference the correct data blocks, time windows, and baselines?
G3 (Objective Correctness): Are the facts, formulas, and numbers verifiable against a frozen gold-standard data table?
G4 (Reasoning Quality): Is the evidence-to-conclusion chain credible and logically sound?
G5 (Recommendation Quality): Are the recommendations specific, parameterised, and reversible?

To ensure grading rigor, we utilized a dual-grader design. Claude Opus 4.7 acted as an LLM judge to provide a baseline score, while a panel of 9 human raters (all with 5+ years of Amazon ads experience) cross-graded the subjective dimensions (G2, G4, G5). The final scores blend both human and AI evaluations to remove individual rater bias.

The Accuracy Multiplier (Why G3 is Split Out)

In this benchmark, objective accuracy (G3) is not averaged in with the other dimensions; it acts as a pass/fail multiplier.

Formula: Composite = Q_avg (G1, G2, G4, G5) × G3 Accuracy %

This design honors a harsh reality of e-commerce advertising: excellent reasoning applied to the wrong data is worse than useless, it’s dangerous.

Benchmark Results

	Agent	Quality avg	G3 accuracy	Composite
1st	Xnurta Insight Agent	4.45	86.1%	3.84
2nd	Claude Opus 4.7	4.46	35.3%	1.57
3rd	Gemini 3.1 Pro	3.54	27.6%	0.98
4th	GPT 5.5	2.35	20.0%	0.47

The Foundation Model Collapse

Before accuracy is factored in, Claude Opus 4.7 and the Xnurta Insight Agent are effectively tied in Quality Average (4.46 vs. 4.45). Opus excels at articulating uncertainty and providing highly specific recommendation parameters.

‍

However, post G3, the picture changes entirely. Xnurta’s composite score is 2.4× higher than Opus 4.7, 3.9× higher than Gemini 3.1 Pro, and 8.2× higher than GPT 5.5.

‍

The foundation models fail because they struggle to pull and compute accurate metrics from raw data schemas. Even the strongest foundation model hallucinated or miscalculated data in roughly two-thirds of its responses. GPT 5.5 further struggled with systemic template-spamming, returning generic "Optimization Candidates" regardless of the prompt.

Conclusion & The Path Forward

The qualitative output of modern LLMs is no longer the bottleneck in AI-driven advertising. The true bottleneck is data retrieval and entity resolution. For brands and agencies making rapid financial decisions, the Xnurta Retail Media Insight Benchmark proves that conversational fluency must be backed by a domain-native data architecture.

‍

Establishing an Industry Standard

Because buyers deserve clarity, this benchmark is just the first step. Xnurta is officially launching an AI Retail Media Council to collaborate with industry leaders, agencies, and brands. Our goal is to evolve this methodology into an open, industry-standard evaluation framework, ensuring that the future of retail media AI is built on verifiable accuracy, not just marketing claims.