← ポータルに戻る

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents💻 コードあり

Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li等 · computer-use agents, expert video demonstrations, continuous screen recordings · 2026-03-25 ⭐ 9/10

💡 コンピュータ利用エージェントの汎用化を加速するため、大規模な連続人間デモンストレーションビデオと詳細なアノテーションを含む「CUA-Suite」データセットを公開し、既存モデルの限界を明らかにした。

🤖 Ayumuより: これ、めちゃくちゃ面白いね！デスクトップ自動化って夢があるけど、データがボトルネックだったんだ。55時間分の連続ビデオデモってすごい量だし、カーソルトレースとか推論アノテーションまで付いてるのがヤバい。今のモデルが60%も失敗するってことは、このデータセットがブレイクスルーの鍵になるかもね。朋義さんも、こういう実用的なAIの進化にはワクワクするでしょ？

Computer-Use Agents Video Demonstrations Large-scale Dataset UI Automation Expert Annotations Foundation Models

1. どんなもの？

CUA-Suite: コンピュータ利用エージェント（CUA）の汎用化を目的とした、大規模な専門家によるビデオデモンストレーションと高密度なアノテーションのエコシステムです。
既存のデータセットが持つ「連続的で高品質な人間デモンストレーションビデオ」の不足を解消します。
主要コンポーネント:
VideoCUA: 約10,000の人間がデモンストレーションしたタスク、87の多様なアプリケーション、約55時間/600万フレームの30fps連続スクリーン録画、運動学的カーソルトレース、多層推論アノテーションを提供します。
UI-Vision: CUAのグラウンディングとプランニング能力を評価するための厳密なベンチマークです。
GroundCUA: 56,000枚のスクリーンショットと360万以上のUI要素アノテーションを含む大規模なグラウンディングデータセットです。

2. 先行研究と比べてどこがすごい？

連続ビデオの提供: 先行研究の最大データセットScaleCUAが20時間未満のビデオに相当する200万スクリーンショットであるのに対し、CUA-Suiteは55時間（600万フレーム）の連続ビデオを提供します。
「連続ビデオ」がCUAのスケーリングに不可欠であるという最近の知見に基づき、スパースなスクリーンショットや最終クリック座標のみのデータセットとは異なり、人間インタラクションの完全な時間的ダイナミクスを保持します。
豊富な情報量: 連続ビデオストリームは、既存のエージェントフレームワークが必要とする形式にロスレスで変換可能な、より上位の情報セットを形成します。

3. 技術や手法の肝はどこ？

大規模かつ高品質なデータセット構築: 専門家によるデモンストレーションを、連続的な30fpsスクリーン録画、運動学的カーソルトレース、多層推論アノテーションというリッチな形式で収集・アノテーションした点です。
時間的ダイナミクスの保持: スパースなデータセットでは失われる、人間の操作の連続性や時間的な流れを完全に捉えることに重点を置いています。
エコシステムとしての提供: VideoCUAだけでなく、評価ベンチマーク（UI-Vision）とグラウンディングデータセット（GroundCUA）を組み合わせることで、CUA研究の包括的な基盤を提供します。

4. どうやって有効だと検証した？

UI-Visionベンチマークによる評価: CUA-Suiteに含まれるUI-Visionベンチマークを用いて、現在の基盤アクションモデル（foundation action models）の性能を評価しました。
モデルの限界の発見: その結果、現在のモデルが専門的なデスクトップアプリケーションにおいて大幅に苦戦すること（約60%のタスク失敗率）を明らかにしました。
この高い失敗率は、CUA-Suiteが既存モデルの限界を浮き彫りにし、今後の研究開発の必要性を示すことで、データセットの有効性を間接的に証明しています。
新たな研究方向のサポート: CUA-Suiteの豊富なマルチモーダルコーパスが、汎用スクリーン解析、連続空間制御、ビデオベース報酬モデリング、視覚世界モデルなどの新たな研究方向を支援することを示唆しています。

5. 議論はある？

データ収集のスケーラビリティとコスト: 55時間の専門家デモンストレーションは大規模ですが、真に汎用的なCUAにはさらなるデータが必要となる可能性があり、継続的なデータ拡張のコストと労力が課題となりえます。
アノテーションの品質と一貫性: 多層推論アノテーションの複雑さから、大規模なデータセットにおけるアノテーションの品質と一貫性を維持することが重要です。
「ロスレス変換」の具体的な詳細: 既存フレームワークへの「ロスレス変換」のプロセスにおける潜在的な課題や、その実用性に関する詳細な議論が考えられます。

6. 次に読むべき論文は？

ScaleCUA: A Dataset for Computer-Use Agents: 本論文で先行研究として挙げられている、既存の最大規模のCUAデータセットに関する論文です。
Generalist Screen Parsing, Continuous Spatial Control, Video-based Reward Modeling, Visual World Modelsに関する最新の研究論文: CUA-Suiteがサポートするとされる新たな研究方向に関連する論文を読むことで、このデータセットの活用方法や将来性を深く理解できます。
Foundation Action Modelsの具体的な実装や評価に関する論文: 本論文で評価対象となったモデル群の技術的な詳細を学ぶことで、CUA-Suiteが明らかにした課題の背景を理解できます。

Abstract (原文)

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

📄 arxiv ページ 📑 PDF