OneVL: One-Step Latent Reasoning and Planning with Vision-La

1. どんなもの？

自動運転などのVLA（Vision-Language Action）タスクにおいて、高速かつ高精度な推論と計画を可能にする新しいフレームワーク「OneVL」です。
従来のChain-of-Thought（CoT）推論が持つ自己回帰的な性質による遅延問題を解決し、リアルタイムでの展開を目指します。
言語的な推論だけでなく、視覚的な世界モデルの因果ダイナミクス（道路形状、エージェントの動き、環境変化など）を潜在空間に組み込むことで、より汎用性の高い表現を獲得します。

2. 先行研究と比べてどこがすごい？

**CoTの遅延問題の解決:** 自己回帰的なCoTの推論速度を、回答のみの予測と同等の「ワンステップ」処理にまで大幅に削減しました。
**Latent CoTの性能向上:** 従来のLatent CoT手法が明示的なCoTに劣っていた点を克服し、初めて明示的なCoTを上回る精度を達成しました。
**因果ダイナミクスの組み込み:** 純粋に言語的な潜在表現に留まらず、視覚世界モデルを通じて現実世界の因果ダイナミクスを潜在空間に強制的に学習させることで、より堅牢で汎化可能な推論能力を実現しました。

3. 技術や手法の肝はどこ？

**統一されたVLAとWorld Modelフレームワーク:** 視覚情報と言語情報を統合し、環境の因果関係をモデル化する世界モデルの概念を取り入れています。
**コンパクトな潜在トークンとデュアル補助デコーダ:**
推論をコンパクトな潜在トークンに圧縮し、これを2つの補助デコーダで教師あり学習させます。
**言語デコーダ:** テキストのCoTを再構築することで、言語的な推論能力を潜在空間に学習させます。
**視覚世界モデルデコーダ:** 将来のフレームトークンを予測することで、視覚的な因果ダイナミクス（道路、エージェント、環境変化）を潜在空間に強制的に学習させます。
**3段階トレーニングパイプライン:** 潜在空間を軌道予測、言語、視覚の各目的と段階的に整合させることで、安定した共同最適化を実現します。
**推論時の高速化:** 推論時には補助デコーダを破棄し、潜在トークンを単一の並列パスで事前入力することで、ワンステップ推論による高速化を実現します。

4. どうやって有効だと検証した？

4つのベンチマークでOneVLを評価しました。
その結果、OneVLが明示的なCoTを上回る精度を達成し、同時に回答のみの予測と同等の低遅延を実現したことを示しました。
Latent CoTメソッドとして初めて明示的なCoTを凌駕し、最先端（SOTA）の精度を達成したことを実証しました。

5. 議論はある？

アブストラクトからは直接的な議論の記述はありませんが、本研究が克服したとされるLatent CoTの「表現力の限界」や「学習の難しさ」は、引き続きより複雑なシナリオや多様なタスクへの適用において議論の余地があるでしょう。
3段階トレーニングの複雑性や、デュアルデコーダの設計が他のVLAタスクにどれだけ汎用的に適用できるか、また潜在空間の解釈性についても、今後の研究で深掘りされる可能性があります。

6. 次に読むべき論文は？

CoTの遅延問題に取り組む他のLatent CoT手法や、World Modelを用いた自動運転・ロボティクス関連の論文。
VLAにおける推論と計画の統合に関する論文、特に視覚的な因果ダイナミクスを潜在空間にエンコードする手法に焦点を当てた研究。
例: 「DreamerV3: Mastering Diverse Domains with World Models」（World Modelの代表的な研究）、または「Chain-of-Thought Prompting Elicits Reasoning in Large Language Models」（CoTの基礎論文）など。

Abstract (原文)

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation💻 コードあり

Abstract (原文)