随着大模型技术的快速发展,训练数据已成为生成式人工智能性能跃升与能力涌现的基础性要素。然而,训练数据在来源、结构与使用方式上的高度复杂性,使其在模型训练与内容生成过程中不断引发个人信息保护、著作权保护及数据权益配置等方面的侵权风险。现有法律规范多以传统数据处理与内容生产模式为假定前提,难以有效回应大模型训练数据所引发的新型风险。因而,本文从训练数据的来源类型与法律属性出发,系统梳理大模型训练数据在全生命周期中的主要侵权风险,分析其制度成因,并在此基础上提出以分类治理、风险导向与责任配置为核心的法治路径,以期为大模型技术的合规发展提供规范支持。
With the rapid development of large model technology, training data has become a fundamental element
for the performance improvement and capability emergence of generative artificial intelligence. However, the high
complexity of training data in terms of sources, structures and usage methods has continuously triggered infringement
risks in personal information protection, copyright protection and data rights allocation during the model training and
content generation processes. Existing legal norms are mostly based on the assumption of traditional data processing
and content production models, making it difficult to effectively respond to the new risks caused by training data for
large models. Therefore, starting from the source types and legal attributes of training data, this paper systematically
sorts out the main infringement risks in the full life cycle of training data for large models, analyzes the institutional
causes, and on this basis, puts forward a legal governance path centered on classified governance, risk orientation and
liability allocation, so as to provide normative support for the compliant development of large model technology.