From Scalable Imitation Learning to Scalable Exploration Learning

We are likely at the start of a significant paradigm shift in artificial intelligence (AI), comparable to how AlexNet marked the transformative arrival of deep learning in 2012 [1].

To frame this shift, consider that aside from supervised and unsupervised learning, AI research can can be broadly divided into two additional categories, which often overlap in contemporary AI systems:

  • Imitation Learning (IL): A technique in AI where an agent learns to perform tasks by observing and mimicking the actions of an expert or human demonstrator.

  • Exploration learning (EL): A learning strategy in AI where an agent independently discovers knowledge or skills by actively experimenting, interacting, and gathering information from its environment, rather than relying on provided demonstrations.

Both imitation learning and exploration learning existed well before 2012. However, the switch to deep learning, significantly accelerated by Transformers in 2017 [2], provided a scalable, generalizable approach for solving IL problems. This generality is demonstrated impressively in modern large language models (LLMs) such as GPT-4 and GPT-4.5 [3], which have been trained on vast portions of the Internet and fine-tuned to imitate an imaginary “assistant”.

Today, we are witnessing the emergence of a similar scalable framework for exploration learning. This new approach combines strong foundation models with extensive multi-stage reinforcement learning (RL), where feedback comes directly from the environment or “reality” itself—examples include mathematical verifiers, code unit-tests, and empirical validation in scientific problem-solving. Models such as OpenAI’s O1/O3/O4, DeepSeek-R1 [4], and Llama-Nemotron-Ultra [5] are examples of this emerging trend.

The reason this is huge is that unlike IL, EL is likely to lead to a (general) super-human intelligence. This potential has already been demonstrated within specific domains such as chess, Go [6], complex computer games [7], protein folding [8], and certain areas of geometry [9]. In each of these areas, success followed a remarkably consistent recipe: deep neural network heuristics combined with exploration (search or self-play) and precise, non-human-derived feedback. When applied to reasoning-capable large language models (LLMs), this recipe is becoming increasingly generalizable, opening pathways to breakthroughs across numerous domains.

Like in 2012 with DL, the exact recipe will surely evolve many times, and many ideas and methods are yet to be discovered/published. However, I believe the general approach will rapidly scale from here. These models essentially have solved math today to a level above 99.99% of humans, coding will “fall” next (this year). Then areas where reward acquisition is not instantaneous but still within minutes, then areas where getting reward takes longer, etc.

References

[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems 25 (2012).

[2] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).

[3] Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” Advances in neural information processing systems 35 (2022): 27730-27744.

[4] Guo, Daya, et al. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.” arXiv preprint arXiv:2501.12948 (2025).

[5] https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1

[6] Silver, David, et al. “Mastering the game of go without human knowledge.” nature 550.7676 (2017): 354-359.

[7] Berner, Christopher, et al. “Dota 2 with large scale deep reinforcement learning.” arXiv preprint arXiv:1912.06680 (2019).

[8] Jumper, John, et al. “Highly accurate protein structure prediction with AlphaFold.” nature 596.7873 (2021): 583-589

[9] Chervonyi, Yuri, et al. “Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2.” arXiv preprint arXiv:2502.03544 (2025).