ShotVerse: Advancing Cinematic Camera Control for Text-Drive

1. どんなもの？

テキストから映画的なマルチショットビデオを生成する「ShotVerse」というフレームワークを提案。
従来のテキストプロンプトによるカメラ制御の不正確さや、手動での軌道指定の煩雑さ・失敗率の高さを解決することを目指す。
「Plan-then-Control」という2段階のアプローチを採用。
VLM（Vision-Language Model）ベースのPlannerがテキストから映画的でグローバルにアラインされたカメラ軌道を計画し、Controllerがカメラアダプターを介してその軌道をマルチショットビデオコンテンツとしてレンダリングする。

2. 先行研究と比べてどこがすごい？

信頼性の低いテキスト制御と労働集約的な手動プロットの間のギャップを効果的に埋める点。
先行研究では、テキストプロンプトのみでは精密なカメラ制御が難しく、手動での軌道指定は手間がかかり、既存モデルでは実行失敗につながりやすかった。
データ中心のパラダイムシフトにより、(Caption, Trajectory, Video) のアラインされたトリプレットが自動プロットと正確な実行を繋ぐという新しい視点を導入。
自動マルチショットカメラキャリブレーションパイプラインを開発し、バラバラなシングルショット軌道を統一されたグローバル座標系にアラインすることで、高品質なデータセット「ShotVerse-Bench」を構築した点がユニーク。

3. 技術や手法の肝はどこ？

**データ中心のパラダイムシフト**: アラインされた (Caption, Trajectory, Video) トリプレットが、自動プロットと正確な実行を繋ぐ固有の結合分布を形成するという仮説。
**「Plan-then-Control」フレームワーク**: 生成プロセスを2つの協調エージェントに分離。
**VLMベースのPlanner**: 空間的事前知識を活用し、テキストから映画的でグローバルにアラインされたカメラ軌道（trajectory）を生成する。
**Controller**: カメラアダプターを介して、Plannerが生成した軌道をマルチショットビデオコンテンツとしてレンダリングする。
**自動マルチショットカメラキャリブレーションパイプライン**: 複数のシングルショット軌道を統一されたグローバル座標系に統合し、ShotVerse-Benchデータセットの構築を可能にする。

4. どうやって有効だと検証した？

構築した高忠実度な映画的データセット「ShotVerse-Bench」の3トラック評価プロトコルを用いて、広範な実験を実施。
実験結果により、ShotVerseが優れた映画的美学を達成し、カメラ精度が高く、ショット間の一貫性があるマルチショットビデオを生成できることを示したと主張。

5. 議論はある？

アブストラクトからは直接的な議論や限界は読み取れないが、一般的に以下の点が議論の対象となりうる。
VLMベースのPlannerが獲得する「空間的事前知識」の具体性や、それがどれだけ多様な映画的表現に対応できるか。
「Plan-then-Control」というモジュール化されたアプローチが、エンドツーエンドの学習と比較して、各モジュールのエラー伝播や全体最適化の難しさといった課題を抱える可能性。
生成されるビデオコンテンツの多様性、新規性、および計算コスト。

6. 次に読むべき論文は？

VLM (Vision-Language Model) を用いた動画生成やカメラ制御に関する最新の研究論文。
カメラアダプターのアーキテクチャや学習手法に関する詳細な論文。
マルチショットビデオ生成におけるデータセット構築や評価指標に関する論文、特にグローバル座標系での軌道アラインメント手法に焦点を当てたもの。
テキストから3Dシーンやカメラパスを生成する関連研究。

Abstract (原文)

Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.

ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation💻 コードあり

Abstract (原文)