Lifan Yuan$^$, Weize Chen$^$, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan Liu, Maosong Sun, Hao Peng
$^*$: Project co-lead. Orders are determined randomly.
<aside> 💡
f(g(x))
) enabled it to spontaneously solve problems where they were nested up to six levels deep, far beyond its training. In contrast, training solely on non-nested problems leads to poor generalization to more complex compositional problems.pass@1000
performance. This is often misinterpreted as a conclusive sign that RL does not teach new abilities. However, when we evaluated the RL-trained model on problems of higher compositional levels, the RL-trained model's pass@1000
is significantly and increasingly better than the base model's.There has been an active debate on whether LLMs learn new skills through RL. In traditional RL, models are typically trained from scratch but ultimately learn strategies that can tackle similar downstream scenarios. The most famous example is probably AlphaZero. The model was trained from scratch but ended up beating human experts after learning moves and strategies in Go. However, revisiting this in the context of LLMs yield surprising and concerning observations that “Aha Moments” are not emergent in RL but due to the existence of certain cognitive behaviors in base models [1], and the gap in pass@k between models before and after RL shrinks as k increases [2]. Consequently, a growing proportion of the LLM community is establishing a pessimistic view that LLMs do not learn new skills during RL training. This ambiguity begs several important research questions: (1) Does an LLM learn any new skill in RL? (2) If they do, what is learned, and (3) how to incentivize the learning?
Figure 1: An simplified illustration of the string transformation prediction task that serves as the testbed of this blog.
To answer these questions, in this blog, we provide concrete evidence that LLMs indeed learn new skills in RL, at least by composing and generalizing existing skills to solve more complex problems, and provide conditions that should be met for such learning. We design a synthetic task, string transformation prediction, which requires models to predict an output string given input strings and string transformation functions. With this decontaminated task and on-policy rollouts only, our setup enables us to investigate the problem in a clean way without any external noises and confounders. Our results show that an LLM does learn compositional skills if the RL training task requires composing atomic skills. The learned compositional skills are generalizable to more complex problems, and are even transferable to completely different tasks if corresponding atomic skills are already trained into the base model via next-token prediction (NTP) training.
Figure 2: Illustration of the RL Compositionality Hypothesis. The Base LLM first undergoes NTP training to acquire atomic skills. When only trained with RL on these atomic skills, it excels at those but fails to compose them. However, when trained with RL on even simple compositional tasks, it learns generalizable compositional skills.
Moreover, we find that even following the practice of existing work to measure pass@k accuracy, our results still challenge their conclusions, as theirs only holds true on tasks where the base model already performs well with a decent pass@k accuracy. We propose that negative results from existing literature may primarily stem from the inappropriate evaluation setup:
Figure 3: The claim that RL only reranks by looking at pass@k may be flawed.