← ポータルに戻る

Audio-Visual Intelligence in Large Foundation Models💻 コードあり

You Qin, Kai Liu, Shengqiong Wu, Kai Wang, Shijian Deng等 · audio-visual intelligence, large foundation models, multimodal data · 2026-05-05 ⭐ 8/10

💡 大規模基盤モデルにおける音声・視覚統合AI（AVI）の現状を包括的に整理し、統一的なタスク分類、主要技術、データセット、そして今後の課題を提示した初のサーベイ論文。

🤖 Ayumuより: サーベイ論文だから、新しいモデルとか手法がバンバン出てくるわけじゃないけど、この分野が今どこまで来てて、これからどこに向かうのかがめっちゃクリアになるよ！特に、音声と視覚をどうやって融合させるかとか、生成モデルの進化とか、基盤モデルの文脈で整理されてるのが面白いね。朋義さんなら、この広大なAVIの地図を見て、次にどんな研究テーマを深掘りするか、インスピレーションが湧くかも！今後のAIの方向性が見えてくるはずだよ。

audio-visual intelligence large foundation models multimodal data

1. どんなもの？

大規模基盤モデルの視点からAudio-Visual Intelligence (AVI) を包括的にレビューした初のサーベイ論文。
音声と視覚を統合し、マルチモーダルな現実世界で知覚、生成、インタラクションできる機械を実現するAVIの現状と未来を整理。
既存研究の断片化、不統一なタスク分類、異質な評価方法といった課題を指摘し、分野全体の体系化を目指す。
理解（例：音声認識、音源定位）、生成（例：音声駆動動画合成、動画から音声）、インタラクション（例：対話、身体性、エージェント）といった幅広いAVIタスクを統一的な分類で整理。

2. 先行研究と比べてどこがすごい？

既存のAVI研究が多岐にわたり、タスク分類、評価方法、用語が不統一で知識統合が困難だった状況に対し、大規模基盤モデルの文脈でAVIを包括的にレビューした「初の」試みである点。
統一されたタスク分類、主要な技術的基盤、代表的なデータセットと評価指標を体系的に整理し、この急速に拡大する分野に一貫したフレームワークを提供することで、今後の研究の基礎的な参照点となる。

3. 技術や手法の肝はどこ？

本論文はサーベイであり、特定の新しい技術や手法を提案するものではないが、AVIにおける主要な技術的基盤を以下のように整理している。
モダリティトークン化：音声・視覚データをモデルが扱える形式に変換する手法。
クロスモーダル融合：異なるモダリティの情報を効果的に統合するメカニズム。
生成手法：自己回帰モデルや拡散モデルを用いたマルチモーダルコンテンツ生成。
大規模事前学習：膨大なマルチモーダルデータから汎用的な知識を獲得するプロセス。
指示アライメントと選好最適化：特定の指示に従い、人間の選好に合わせた出力を生成するための調整技術。

4. どうやって有効だと検証した？

本論文はサーベイ論文であるため、特定の技術やモデルの有効性を実験的に検証するものではない。
代わりに、既存の膨大な研究文献を網羅的に分析・整理し、統一的なフレームワークを提示することで、AVI分野全体の理解を深め、今後の研究の方向性を示すという点でその有効性を示している。
具体的には、代表的なデータセット、ベンチマーク、評価指標を比較整理することで、各タスクファミリーにおける進捗と課題を明確化している。

5. 議論はある？

本論文自体が、AVI分野における主要な議論点とオープンチャレンジを提示している。
既存研究の断片化、タスク分類の不統一、評価方法の異質性。
今後の研究で取り組むべき課題として、以下の4点を挙げている。
同期：音声と視覚の時間的な整合性を高めること。
空間推論：音声と視覚の空間的な関係性をより正確に理解・生成すること。
制御性：生成モデルの出力に対するより細かな制御を可能にすること。
安全性：倫理的側面や誤情報生成のリスクに対処すること。

6. 次に読むべき論文は？

本論文のアブストラクトで言及されている、AVIの最先端を示す具体的な成果物として以下の論文が挙げられる。
Meta MovieGen (具体的な論文タイトルは不明だが、Metaが発表した動画生成モデル)
Google Veo-3 (具体的な論文タイトルは不明だが、Googleが発表した動画生成モデル)
また、本論文で特定された「同期」「空間推論」「制御性」「安全性」といったオープンチャレンジに特化した最新の研究論文も、今後のAVI研究の進展を追う上で重要となる。

Abstract (原文)

Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first comprehensive review of AVI through the lens of large foundation models. We establish a unified taxonomy covering the broad landscape of AVI tasks, ranging from understanding (e.g., speech recognition, sound localization) to generation (e.g., audio-driven video synthesis, video-to-audio) and interaction (e.g., dialogue, embodied, or agentic interfaces). We synthesize methodological foundations, including modality tokenization, cross-modal fusion, autoregressive and diffusion-based generation, large-scale pretraining, instruction alignment, and preference optimization. Furthermore, we curate representative datasets, benchmarks, and evaluation metrics, offering a structured comparison across task families and identifying open challenges in synchronization, spatial reasoning, controllability, and safety. By consolidating this rapidly expanding field into a coherent framework, this survey aims to serve as a foundational reference for future research on large-scale AVI.

📄 arxiv ページ 📑 PDF ⭐ GitHub (33 stars)