MemLens: Benchmarking Multimodal Long-Term Memory in Large V

1. どんなもの？

大規模視覚言語モデル（LVLM）のマルチモーダルな長期記憶能力を評価するための包括的な新しいベンチマーク「MEMLENS」を提案。
長いマルチモーダルな対話において、LVLMのメモリ能力を、long-context LVLMとmemory-augmented agentという2つの主要なアプローチで体系的に比較・評価する。
真にマルチモーダルな証拠を必要とする質問で構成されており、既存のベンチマークのギャップを埋めることを目的としている。

2. 先行研究と比べてどこがすごい？

既存のベンチマークが欠けていた「真にマルチモーダルな証拠を必要とする質問」に特化している点。
画像アブレーション研究により、視覚的証拠が質問解決に不可欠であることを定量的に示している。
long-context LVLMとmemory-augmented agentという、LVLMにメモリ機能を提供する2つの主要なアプローチを初めて体系的に比較・評価している点。

3. 技術や手法の肝はどこ？

**MEMLENSベンチマークの設計:**
789の質問、5つのメモリ能力（情報抽出、複数セッション推論、時間推論、知識更新、回答拒否）を網羅。
4つの標準コンテキスト長（32K-256Kトークン）に対応し、クロスモーダルなトークンカウントスキームを採用。
マルチモーダルな複数セッション会話形式で、長期記憶と推論を要求するシナリオを構築。
**広範なモデル評価:** 27のLVLMと7のmemory-augmented agentsを評価対象とし、両アプローチの性能特性を詳細に分析。

4. どうやって有効だと検証した？

**画像アブレーション研究:** 証拠画像を除去すると、フロンティアLVLMの精度が2%未満に急落することを示し、MEMLENSが視覚的証拠を真に必要とするタスクであることを証明。
**広範なモデル評価と分析:** 27のLVLMと7のmemory-augmented agentsをMEMLENSで評価し、以下の課題を特定。
Long-context LVLMは短コンテキストでは高い精度を示すが、会話が長くなると性能が劣化する。
Memory agentsは長さに安定しているが、保存時の圧縮により視覚的忠実度が失われる。
特に複数セッション推論では、ほとんどのシステムが30%未満の低い性能に留まり、どちらのアプローチも単独ではタスクを解決できないことを示し、ベンチマークの難易度と有効性を裏付けた。

5. 議論はある？

現在のところ、long-context LVLMとmemory-augmented agentのどちらのアプローチも、マルチモーダルな長期記憶タスクを完全に解決できていない。
Long-context LVLMはコンテキスト長が伸びるにつれて性能が劣化するというスケーラビリティの問題を抱えている。
Memory agentsは、情報を圧縮して保存する際に視覚的忠実度が失われるという課題がある。
特に、複数セッションにわたる推論能力は、ほとんどのシステムで非常に低い性能に留まっており、今後の研究の大きな課題となっている。
論文では、これらの課題を解決するために、long-context attentionと構造化されたマルチモーダル検索を組み合わせたハイブリッドアーキテクチャが必要であると提言している。

6. 次に読むべき論文は？

マルチモーダルな長期記憶における「構造化されたマルチモーダル検索」に関する論文。
Long-context LVLMのコンテキスト長による性能劣化を改善する新しいアーキテクチャや手法に関する論文。
Memory-augmented agentsにおいて、視覚的忠実度を維持しつつ効率的に情報を圧縮・検索する手法に関する論文。
本論文で提唱されている「long-context attentionと構造化されたマルチモーダル検索を組み合わせたハイブリッドアーキテクチャ」を具体的に提案・実装している論文。

Abstract (原文)

Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models💻 コードあり

Abstract (原文)