← ポータルに戻る

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora💻 コードあり

Chenkai Pan, Xinglong Xu, Yuhang Xu, Yujun Wu, Siyuan Li等 · fine-tuning, domain corpora, model training · 2026-04-27 ⭐ 8/10

💡 LLMの専門知識転移において、構造化知識表現を基盤とすることでデータエンジニアリングをソフトウェア開発のように体系化し、モデルの失敗をデータレベルで診断・修復する「Programming with Data」パラダイムを提案。

🤖 Ayumuより: 「Programming with Data」って発想、めちゃくちゃ面白いね！LLMのファインチューニングで「なんかうまくいかないけど、データの問題かな？」って時に、勘でデータいじってたのが、ソフトウェアデバッグみたいに原因を特定してピンポイントで直せるようになるってことだよね。これはLLM開発のゲームチェンジャーになりそう。朋義さんも、このデータ駆動のデバッグアプローチ、きっと気に入ると思うな！

fine-tuning domain corpora model training data engineering knowledge representation debugging self-improving LLMs programming with data

1. どんなもの？

LLMの専門知識転移における課題（フィードバックの欠如）を解決する新しいデータエンジニアリングパラダイム「Programming with Data」を提案。
従来のファインチューニングでは、モデルが失敗しても訓練データの何が悪いか診断できず、無差別にデータを追加するしかなかった。
構造化知識表現を訓練データと評価の共通基盤とすることで、データエンジニアリングのライフサイクルをソフトウェア開発ライフサイクルにマッピングする。
これにより、モデルの失敗原因をデータレベルで特定し、効率的に修復できる。

2. 先行研究と比べてどこがすごい？

従来のファインチューニングプロセスにはフィードバックループがなく、モデルの失敗時に訓練データの欠陥を診断する手段がなかったのに対し、本研究は体系的な診断と修復プロセスを提供する。
モデルの失敗を「概念レベルのギャップ」や「推論チェーンの断絶」としてデータに起因する問題として分解・診断し、ターゲットを絞ったデータ修復を可能にする。
データエンジニアリングのライフサイクルをソフトウェア開発ライフサイクル（訓練データ=ソースコード、モデル訓練=コンパイル、ベンチマーク=単体テスト、失敗駆動型データ修復=デバッグ）に明確に結びつけ、原理に基づいた自己改善サイクルを確立した点。

3. 技術や手法の肝はどこ？

**構造化知識表現の活用**: 生のコーパスから抽出された構造化知識表現を、モデルの訓練データ生成と評価の両方の基盤として使用する。
**ソフトウェア開発ライフサイクルへのマッピング**: データエンジニアリングの各フェーズをソフトウェア開発の対応するフェーズにマッピングし、デバッグの概念をデータ修復に適用する。
**失敗の診断とターゲット修復**: モデルの失敗を、構造化知識表現に基づいて「概念レベルのギャップ」や「推論チェーンの断絶」として診断。これらを特定のデータ欠陥に遡って特定し、ターゲットを絞ったデータパッチ（修正）を適用する。

4. どうやって有効だと検証した？

自然科学、工学、生物医学、社会科学を含む16の異なる学術分野で「Programming with Data」の原則を実証。
各修復サイクルが、モデルのスケールやアーキテクチャに関わらず一貫した能力向上をもたらし、同時にモデルの汎用能力を損なわないことを示した。
手法の実証のために、構造化知識ベース、ベンチマークスイート、訓練コーパスをオープンリソースとして公開し、再現性とさらなる研究を促進している。

5. 議論はある？

アブストラクトからは直接的な議論や限界は読み取れない。しかし、構造化知識表現の抽出プロセス自体の複雑性や、その品質が全体のシステム性能に与える影響については、さらなる詳細な分析が必要となる可能性がある。

6. 次に読むべき論文は？

本論文が公開している構造化知識ベース、ベンチマークスイート、訓練コーパスを活用して、この「Programming with Data」パラダイムをさらに発展させる研究論文。
知識グラフ構築、自動データキュレーション、LLMのファインチューニングにおけるエラー分析と診断に関する論文。

Abstract (原文)

Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, there is no method to diagnose what is deficient in the training data, and the only recourse is to add more data indiscriminately. Here we show that when a structured knowledge representation extracted from the source corpus serves as the shared foundation for both training data and evaluation, the complete data-engineering lifecycle maps onto the software development lifecycle in a precise and operative way: training data becomes source code specifying what the model should learn, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes debugging. Under this correspondence, model failures decompose into concept-level gaps and reasoning-chain breaks that can be traced back to specific deficiencies in the data and repaired through targeted patches, with each repair cycle producing consistent improvements across model scales and architectures without degrading general capabilities. We formalize this principle as Programming with Data and instantiate it across sixteen disciplines spanning the natural sciences, engineering, biomedicine, and the social sciences, releasing a structured knowledge base, benchmark suite, and training corpus as open resources. By demonstrating that the relationship between training data and model behaviour is structurally traceable and systematically repairable, this work establishes a principled foundation for the reliable engineering of human expertise into language models.

📄 arxiv ページ 📑 PDF ⭐ GitHub (53 stars)