← ポータルに戻る

Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation💻 コードあり

Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan等 · reinforcement learning, reward models, image editing · 2026-03-12 ⭐ 9/10

💡 強化学習を用いた画像編集・生成における報酬モデルの「幻覚」と「ノイズ」の問題を解決するため、高品質なデータと専用の報酬モデル、RL統合戦略からなるFIRMフレームワークを提案し、忠実度と指示順守を大幅に向上させた。

🤖 Ayumuより: この論文、画像生成の「批評家」を賢くするって発想が面白いね！報酬モデルの幻覚をなくして、人間が「これだ！」って思う画像をちゃんと作れるようになるのはすごい進化だと思う。特に、編集と生成で評価基準を分けてるのが賢いな〜。

Reward Modeling Reinforcement Learning Image Editing Image Generation Human Alignment Data Curation

1. どんなもの？

強化学習（RL）を用いた画像編集・生成の品質を向上させるための包括的フレームワーク「FIRM (Faithful Image Reward Modeling)」を提案。
既存の報酬モデルが抱える「幻覚」や「ノイズ」の問題を解決し、より正確で信頼性の高い評価（批評）を提供する。
高品質なデータセット構築、専用の報酬モデル訓練、新しいベンチマーク、RLへの統合戦略を含む。

2. 先行研究と比べてどこがすごい？

既存の報酬モデルが幻覚やノイズに悩まされ、RLの最適化を誤導していた点を克服し、人間の判断とより高いアライメントを持つ報酬モデルを構築した。
特に画像編集と生成に特化した報酬モデルと評価基準（編集は実行と一貫性、生成は指示追従）を導入し、汎用モデルよりも高い忠実度と指示順守を実現した。
編集・生成批評家専用の包括的なベンチマーク「FIRM-Bench」を提供し、批評家の評価を標準化した。

3. 技術や手法の肝はどこ？

**高品質なデータキュレーションパイプライン**: 編集は「実行」と「一貫性」、生成は「指示追従」を重視した評価基準で、FIRM-Edit-370KとFIRM-Gen-293Kデータセットを構築。
**専用報酬モデルの訓練**: 上記データセットを用いて、FIRM-Edit-8BとFIRM-Gen-8Bという8Bパラメータの報酬モデルを訓練。
**「Base-and-Bonus」報酬戦略**: 報酬モデルをRLパイプラインに統合する戦略を策定。編集向けにはConsistency-Modulated Execution (CME)、生成向けにはQuality-Modulated Alignment (QMA)を導入し、競合する目的のバランスを取る。

4. どうやって有効だと検証した？

FIRM-Bench上で、提案モデルが既存のメトリクスと比較して人間の判断と優れたアライメントを達成することを示した。
FIRMフレームワークを適用したモデル（FIRM-Qwen-EditとFIRM-SD3.5）が、大幅な性能向上を達成したことを包括的な実験で示した。
実験により、FIRMが幻覚を軽減し、既存の汎用モデルよりも忠実度と指示順守において新しい標準を確立したことを実証した。

5. 議論はある？

アブストラクトからは直接的な議論点は読み取れないが、報酬モデルの構築はデータバイアスや評価の主観性といった課題を常に伴うため、本研究の「高品質」なデータキュレーションの限界や、異なるドメインへの一般化可能性については議論の余地があるかもしれない。
「Base-and-Bonus」報酬戦略における各項目の重み付けが、様々なタスクやユーザーの好みにどのように影響するかは、さらなる分析が必要となる可能性がある。

6. 次に読むべき論文は？

強化学習を用いた画像生成・編集の他のSOTAモデルやフレームワークに関する論文。
DPO (Direct Preference Optimization) やRLHF (Reinforcement Learning from Human Feedback) を画像生成に適用した研究。
画像評価メトリクス（FID, CLIP-scoreなど）の限界と、人間による評価とのギャップを埋めるための研究。

Abstract (原文)

Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.

📄 arxiv ページ 📑 PDF ⭐ GitHub (23 stars)