arxiv:2604.05467

CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation

Published on Apr 7

· Submitted by

Sid on Apr 8

Intuit

Upvote

Authors:

Siddharth Jain ,

Abstract

Researchers developed a framework to measure the operational utility of individual retrieved items in retrieval-augmented generation systems by perturbing evidence and analyzing changes in correctness, grounding faithfulness, and confidence error.

AI-generated summary

As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.

View arXiv page View PDF GitHub 0 Add to collection

Community

jainsid24

Paper author Paper submitter about 13 hours ago

CUE-R introduces an intervention-based framework for evaluating retrieved evidence in RAG systems - going beyond final answer accuracy to measure what each evidence item actually did once a reasoning model acted on it. We perturb evidence via remove, replace, and duplicate operators and measure effects across correctness, grounding, confidence error, and trace divergence. Key finding: duplicate evidence is answer-neutral but behaviorally non-trivial, and multi-hop evidence items interact non-additively. Directly relevant to anyone building or evaluating agentic and reasoning-based LLM systems.

librarian-bot

about 5 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.05467

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.05467 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.05467 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.05467 in a Space README.md to link it from this page.