Abstract
A task-conditioned tool-output pruning model achieves high recall and F1 scores while dramatically reducing input token consumption compared to zero-shot and heuristic baselines.
Coding agents repeatedly consume long tool observations even though only a small fraction of each observation matters for the next step. We study task-conditioned tool-output pruning: given a focused query and one tool output, return the smallest verbatim evidence block the agent should inspect next. We introduce a benchmark of 11,477 examples built from SWE-bench repository interactions and synthetic multi-ecosystem tool outputs, with a manually curated 618-example test set. We fine-tune Qwen 3.5 2B with LoRA and compare it against larger zero-shot models and heuristic pruning baselines. Our model reaches 0.86 recall and 0.80 F1 while removing 92% of input tokens, outperforming zero-shot Qwen 3.5 35B A3B by 11 recall points and all heuristic baselines by a wide margin.
Community
Coding agents can waste most of their context window re-reading noisy tool output. Squeez is a LoRA-tuned Qwen 3.5 2B model that extracts only the relevant lines from raw tool observations (pytest, grep, git
log, kubectl, build logs, etc.), removing 92% of tokens while retaining 0.86 recall. We publish the dataset of 11,477 examples spanning 27 tool types. The model works as a CLI pipe (cat output.txt | squeez "find the bug"), a Python library, or a vLLM server, you can drop it into any coding agent with one line of config. Model, dataset, and code are all Apache 2.0.
Get this paper in your agent:
hf papers read 2604.04979 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper