With the rapid development of large model technology, training data has become a fundamental element
for the performance improvement and capability emergence of generative artificial intelligence. However, the high
complexity of training data in terms of sources, structures and usage methods has continuously triggered infringement
risks in personal information protection, copyright protection and data rights allocation during the model training and
content generation processes. Existing legal norms are mostly based on the assumption of traditional data processing
and content production models, making it difficult to effectively respond to the new risks caused by training data for
large models. Therefore, starting from the source types and legal attributes of training data, this paper systematically
sorts out the main infringement risks in the full life cycle of training data for large models, analyzes the institutional
causes, and on this basis, puts forward a legal governance path centered on classified governance, risk orientation and
liability allocation, so as to provide normative support for the compliant development of large model technology.