SKILL0: In-Context Agentic Reinforcement Learning for Skill

1. どんなもの？

LLMエージェントの「スキル」をモデルのパラメータに直接「内部化」する新しい強化学習フレームワーク「SKILL0」を提案。
従来の推論時スキル付与（in-context augmentation）が抱える、検索ノイズ、トークンオーバーヘッド、知識の真の獲得不足といった課題を解決することを目指す。
最終的には、実行時にスキル検索なしでゼロショットの自律行動を可能にする。

2. 先行研究と比べてどこがすごい？

従来のLLMエージェントは、推論時にスキルをコンテキストとして与えることで能力を拡張していたが、SKILL0はスキルをモデルのパラメータに直接学習させる（internalization）ことで、これらの問題を根本的に解決しようとしている。
特に、トレーニング時に完全なスキルコンテキストから始め、段階的にそれを撤廃していく「動的カリキュラム」を通じて、モデルがスキルを真に獲得するプロセスを設計している点が新しい。

3. 技術や手法の肝はどこ？

**In-Context Agentic Reinforcement Learning (ICARL) フレームワーク**: スキル内部化のための強化学習基盤。
**トレーニング時カリキュラム**: 学習初期は完全なスキルコンテキストを与え、徐々にそのコンテキストを減らしていくことで、モデルがスキルを自律的に利用できるように促す。
**スキルグルーピングと視覚的コンテキスト**: スキルをカテゴリ別にグループ化し、インタラクション履歴と共にコンパクトな視覚的コンテキストとしてレンダリング。これにより、モデルにツール呼び出しや多段階タスクの完了方法を教える。
**動的カリキュラム**: 各スキルファイルの「オンポリシーでの有用性」を評価し、現在のポリシーがまだ恩恵を受けているスキルのみを、線形に減少する予算内で保持。最終的にエージェントは完全にゼロショット設定で動作する。

4. どうやって有効だと検証した？

ALFWorldとSearch-QAという2つの代表的なエージェントタスクベンチマークで広範な実験を実施。
標準的なRLベースラインと比較して、大幅な性能向上を達成したことを示した。
ALFWorldで+9.7%、Search-QAで+6.6%の改善。
ステップあたり0.5kトークン未満という非常に効率的なコンテキスト長を維持しつつ、性能向上を実現していることを確認した。
実装コードはGitHubで公開されている。

5. 議論はある？

アブストラクトからは直接的な議論の記述はないが、スキル内部化の限界として、内部化できるスキルの複雑さや量、新しい環境への汎化能力などが今後の課題として考えられる。
本研究では、効率性（低トークン消費）と性能向上を両立している点を強調しており、実用性への一歩を示している。

6. 次に読むべき論文は？

LLMエージェントにおけるin-context learningやtool-useに関する基礎的な論文（例: ReAct, Reflexionなど）。
強化学習におけるカリキュラム学習や、タスク難易度調整に関する研究。
LLMの知識獲得や汎化能力向上に関する最新の研究。

Abstract (原文)

Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld and +6.6\% for Search-QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization💻 コードあり

Abstract (原文)