ClawBench: Can AI Agents Complete Everyday Online Tasks?

1. どんなもの？

日常のオンラインタスクをAIエージェントがどれだけこなせるかを評価するための、新しいベンチマーク「ClawBench」を提案しています。
これは、人々が日常生活や仕事で定期的に行う必要のある153のシンプルなタスクで構成されています。
15のカテゴリ（購入完了、予約、求人応募など）にわたる144のライブプラットフォームを対象としています。
ユーザー提供ドキュメントからの情報取得、多様なプラットフォームを跨ぐ多段階ワークフロー、詳細なフォーム入力といった、高度で現実的な能力を要求します。

2. 先行研究と比べてどこがすごい？

既存のベンチマークがオフラインのサンドボックス環境や静的なページでエージェントを評価するのに対し、ClawBenchは**実稼働のウェブサイト**で動作します。
これにより、現実世界のウェブインタラクションの完全な複雑さ、動的な性質、および課題を保持したまま評価が可能です。
軽量なインターセプト層を導入し、最終的な送信リクエストのみを捕捉・ブロックすることで、実世界への副作用なしに安全な評価を保証しています。

3. 技術や手法の肝はどこ？

**ClawBenchフレームワークの構築:** 153の日常的なオンラインタスクと、それに対応する144のライブプラットフォームを厳選し、評価シナリオとして体系化しています。
**実稼働ウェブサイトでの評価:** 実際のウェブサイトの動的なコンテンツ、JavaScriptの実行、API呼び出しなど、現実の複雑な挙動をそのまま評価対象とします。
**安全な評価メカニズム:** エージェントがタスクを完了したと判断した際の最終的な送信リクエスト（例: 購入ボタンのクリック、フォームの送信）のみをインターセプトしてブロックすることで、実際の取引やデータ送信を伴わずにタスクの成功を判定します。

4. どうやって有効だと検証した？

7つの最先端AIモデル（プロプライエタリおよびオープンソース）をClawBenchで評価しました。
評価の結果、これらのモデルはごく一部のタスクしか完了できないことを示しました。例えば、Claude Sonnet 4.6は33.3%のタスクしか達成できませんでした。
この結果は、ClawBenchが現在のAIエージェントの限界を明確にし、信頼できる汎用アシスタントとしてのAIエージェントの実現に向けた今後の研究開発の方向性を示す有効な評価ツールであることを実証しています。

5. 議論はある？

現在の最先端AIモデルでさえ、日常のオンラインタスクのごく一部しか完了できないという評価結果は、AIエージェントの実用化に向けた大きな課題を示唆しています。
これは、ウェブサイトの多様性、動的な変化への対応、複雑な多段階の推論、および大量のテキスト入力の正確性といった面で、AIエージェントの能力がまだ不十分であることを浮き彫りにしています。
ClawBenchが提供するタスクの網羅性や難易度設定、評価基準の客観性については、今後のさらなる議論や改善の余地があるかもしれません。

6. 次に読むべき論文は？

WebArena: A Realistic Web Environment for Building and Evaluating Embodied AI Agents
MiniWoB++: A Manually Curated, More Challenging Version of MiniWoB for Benchmarking Web Agents
汎用AIエージェントのプランニング、記憶、推論に関する最新の研究論文

Abstract (原文)

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

ClawBench: Can AI Agents Complete Everyday Online Tasks?💻 コードあり

Abstract (原文)