Papers
arxiv:2604.17308

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Published on Apr 19
ยท Submitted by
Yu Zeng
on Apr 21
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

SkillFlow presents a benchmark for evaluating autonomous agents' ability to discover, repair, and maintain skills over time through a structured lifelong learning protocol.

AI-generated summary

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.

Community

Paper author Paper submitter

We propose SkillFlow, a benchmark for studying whether agents can summarize skills from experience and continuously iterate on and reuse a skill library over time. SkillFlow contains20 task families, each comprising8--9 tasks that share the same underlying workflow. Starting from an empty skill library, an agent solves tasks sequentially in difficulty order, generating or repairing skills after each task for use in subsequent ones. This setup evaluates whether a model can continuously evolve its skill library and effectively reuse it across related tasks.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.17308
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.17308 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.17308 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.17308 in a Space README.md to link it from this page.

Collections including this paper 1