Complementary Reinforcement Learning

1. どんなもの？

LLMベースエージェントの強化学習におけるサンプル効率の低さを解決する新しいパラダイム「Complementary RL」を提案。
脳科学の相補的学習システムに着想を得ており、経験抽出器（experience extractor）と方策アクター（policy actor）がRL最適化ループ内でシームレスに共進化する。
エージェントが過去の経験を効果的に活用し、学習効率を向上させることを目的としている。

2. 先行研究と比べてどこがすごい？

既存の経験活用アプローチの課題を解決している点。
既存手法では、経験が静的に保存されるか、アクターの改善と共進化しないため、経験とアクターの能力の間にミスマッチが生じ、有用性が低下するという問題があった。
Complementary RLでは、経験抽出器がアクターの成功への貢献度に基づいて最適化されるため、アクターの能力向上と同期して経験管理戦略が進化し、経験の有用性がトレーニング全体で維持・向上する。

3. 技術や手法の肝はどこ？

脳科学の「相補的学習システム」の概念をRLに導入したこと。
異なる学習システムが連携して効率的な学習を実現する。
経験抽出器と方策アクターの「共進化」メカニズム。
方策アクターはスパースな結果ベースの報酬で最適化される（従来のRL）。
経験抽出器は、抽出した経験がアクターの成功にどれだけ貢献したかに基づいて最適化される。これにより、アクターの進化に合わせて「良い経験」を抽出・管理する能力が向上する。
これら二つのコンポーネントがRL最適化ループ内で互いに影響を与えながら同時に学習・進化する。

4. どうやって有効だと検証した？

実験的に、経験から学習しない結果ベースのエージェントRLベースラインを上回ることを示した。
単一タスクシナリオにおいて、性能を10%向上させた。
マルチタスク設定においても、堅牢なスケーラビリティ（robust scalability）を示した。
これらの結果は、Complementary RLが効率的な経験駆動型エージェント学習の新しいパラダイムであることを確立している。

5. 議論はある？

アブストラクトからは直接的な議論点や限界は読み取れないが、経験抽出器の「貢献度」を具体的にどのように定量化し、最適化に組み込むかの詳細が、実装上の複雑性や性能に影響を与える可能性がある。
脳科学の概念をRLに適用する際の理論的な厳密性や、その適用範囲の限界なども、論文全体を読めば議論されている可能性がある。

6. 次に読むべき論文は？

脳科学における「相補的学習システム」に関する原典論文やレビュー論文（例: McClelland et al., 1995）。
LLMベースエージェントのサンプル効率改善に関する他のRL手法（例: ReAct, CoT, Reflexionなど）の論文。
経験リプレイやオフラインRLにおける経験管理戦略に関する論文、特に動的な経験選択や優先度付けに関するもの。

Abstract (原文)

Reinforcement Learning (RL) has emerged as a powerful paradigm for training LLM-based agents, yet remains limited by low sample efficiency, stemming not only from sparse outcome feedback but also from the agent's inability to leverage prior experience across episodes. While augmenting agents with historical experience offers a promising remedy, existing approaches suffer from a critical weakness: the experience distilled from history is either stored statically or fail to coevolve with the improving actor, causing a progressive misalignment between the experience and the actor's evolving capability that diminishes its utility over the course of training. Inspired by complementary learning systems in neuroscience, we present Complementary RL to achieve seamless co-evolution of an experience extractor and a policy actor within the RL optimization loop. Specifically, the actor is optimized via sparse outcome-based rewards, while the experience extractor is optimized according to whether its distilled experiences demonstrably contribute to the actor's success, thereby evolving its experience management strategy in lockstep with the actor's growing capabilities. Empirically, Complementary RL outperforms outcome-based agentic RL baselines that do not learn from experience, achieving 10% performance improvement in single-task scenarios and exhibits robust scalability in multi-task settings. These results establish Complementary RL as a paradigm for efficient experience-driven agent learning.

Complementary Reinforcement Learning💻 コードあり

Abstract (原文)