Abstract
Researchers developed a framework to measure the operational utility of individual retrieved items in retrieval-augmented generation systems by perturbing evidence and analyzing changes in correctness, grounding faithfulness, and confidence error.
As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.
Community
CUE-R introduces an intervention-based framework for evaluating retrieved evidence in RAG systems - going beyond final answer accuracy to measure what each evidence item actually did once a reasoning model acted on it. We perturb evidence via remove, replace, and duplicate operators and measure effects across correctness, grounding, confidence error, and trace divergence. Key finding: duplicate evidence is answer-neutral but behaviorally non-trivial, and multi-hop evidence items interact non-additively. Directly relevant to anyone building or evaluating agentic and reasoning-based LLM systems.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs (2026)
- Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation (2026)
- PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning (2026)
- PAR$^2$-RAG: Planned Active Retrieval and Reasoning for Multi-Hop Question Answering (2026)
- Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation (2026)
- MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains (2026)
- Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.05467 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper