LLM Safety From Within: Detecting Harmful Content with Inter

1. どんなもの？

LLMの有害コンテンツ検出のための新しいガードモデル「SIREN」を提案。
既存のガードモデルがターミナル層表現のみに依存するのに対し、SIRENはLLMの「内部層表現」を活用する。
基盤となるLLMを改変することなく、その内部状態から有害性検出器を構築する、軽量かつ高性能なアプローチ。

2. 先行研究と比べてどこがすごい？

**検出性能**: 複数のベンチマークにおいて、SOTAのオープンソースガードモデルを大幅に上回る性能を発揮。
**効率性**: 訓練可能なパラメータ数が既存モデルの250分の1と非常に少なく、軽量である。
**汎化性能**: 未見のベンチマークに対しても優れた汎化能力を示す。
**リアルタイム性・推論効率**: リアルタイムストリーミング検出を可能にし、生成型ガードモデルと比較して推論効率が大幅に向上する。

3. 技術や手法の肝はどこ？

**内部特徴の活用**: LLMの内部層に分散している安全性に関連する特徴（「安全性ニューロン」）を特定し、これらを活用する。
**安全性ニューロンの特定**: 線形プロービング（linear probing）を用いて、各内部層のニューロンが安全性にどれだけ寄与するかを識別する。
**特徴の結合**: 特定された安全性ニューロンを、適応的な層重み付け戦略（adaptive layer-weighted strategy）によって効果的に組み合わせ、最終的な有害性スコアを算出する。

4. どうやって有効だと検証した？

**包括的な評価**: 複数のベンチマークセットを用いて、SIRENの有害性検出性能を評価した。
**SOTAモデルとの比較**: 既存の最先端オープンソースガードモデルと比較し、SIRENが大幅に優れていることを示した。
**効率性の実証**: 訓練可能なパラメータ数が既存モデルの250分の1であることを定量的に示し、その軽量性を強調した。
**汎化性能の確認**: 未見のベンチマークに対する性能も評価し、優れた汎化能力を実証した。
**推論効率の検証**: 生成型ガードモデルとの比較を通じて、リアルタイム検出能力と推論効率の向上を明らかにした。

5. 議論はある？

アブストラクトからは直接的な議論点や限界に関する記述は読み取れない。しかし、一般的には「安全性ニューロン」の定義の頑健性や、特定のLLMアーキテクチャへの依存性、内部表現の解釈性に関するさらなる深掘りが議論の対象となり得る。

6. 次に読むべき論文は？

LLMの内部表現の解釈性（Mechanistic Interpretability）に関する研究論文。
より高度なガードモデルのアーキテクチャや、LLMの安全性評価に関する最新の研究論文。

Abstract (原文)

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.

LLM Safety From Within: Detecting Harmful Content with Internal Representations💻 コードあり

Abstract (原文)