April 2025
·
3 Reads
International Journal of Digital Law and Governance
Pre-trained large language models (LLMs), epitomized by ChatGPT, have leveraged a cornucopia of “big data” to attain substantial leaps in artificial intelligence (AI). Whereas the diminishing returns from pre-training and the depletion of available training data have become evident, the post-training scaling law bolstered by “big GPU” has surfaced as an overriding strategy. Since 2024, post-trained models exemplified by o1 and DeepSeek-R1 have been widely acclaimed as successes in logic-intensive fields like advanced scientific problem-solving, serving as a bellwether for artificial general intelligence (AGI). Driven by the two cardinal elements of computing power and task-specific datasets, the data training processes of post-trained models exhibit more erratic and uncontrollable tendencies, which may be a menace to core societal domains and precipitate systemic friction vis-à-vis the existing data governance derived from pre-trained models. At this watershed moment, this article aims to conduct a comprehensive comparison of training data paradigms between pre-trained and post-trained models and to further develop cogent and favorable governance responses to mitigate emerging risks. Consequently, data security must be established as a prerequisite for AI development, and a lifecycle-based governance framework for AI training data in blended models can be introduced in the metamorphosis toward “bigger AI models”.