Lifan Yuan$^$, Weize Chen$^$, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan Liu, Maosong Sun, Hao Peng

$^*$: Project co-lead. Orders are determined randomly.

<aside> 💡

TL;DR:

🔑 RL unlocks new capabilities by composing existing ones. We synthesized a decontaminated string transformation prediction task. In our controlled task, training a model with RL on problems where functions are nested two levels deep (f(g(x))) enabled it to spontaneously solve problems where they were nested up to six levels deep, far beyond its training. In contrast, training solely on non-nested problems leads to poor generalization to more complex compositional problems.
🎭 The "illusion" of RL’s limited gains in pass@1000. On problems where base models already achieve decent pass@k accuracy, we reproduced previous findings that RL does not improve a model's pass@1000 performance. This is often misinterpreted as a conclusive sign that RL does not teach new abilities. However, when we evaluated the RL-trained model on problems of higher compositional levels, the RL-trained model's pass@1000 is significantly and increasingly better than the base model's.
🧭 Practical Guideline: The success of compositional RL fundamentally relies on the breadth of the model’s atomic skills. Therefore, prioritize maximizing the model’s knowledge of atomic skills through extensive next-token prediction training first. With that foundation in place, targeted RL on a curated set of tasks can then effectively teach the crucial meta-skill of combining what the model already knows. </aside>

1 Introduction

There has been an active debate on whether LLMs learn new skills through RL. In traditional RL, models are typically trained from scratch but ultimately learn strategies that can tackle similar downstream scenarios. The most famous example is probably AlphaZero. The model was trained from scratch but ended up beating human experts after learning moves and strategies in Go. However, revisiting this in the context of LLMs yield surprising and concerning observations that “Aha Moments” are not emergent in RL but due to the existence of certain cognitive behaviors in base models [1], and the gap in pass@k between models before and after RL shrinks as k increases [2]. Consequently, a growing proportion of the LLM community is establishing a pessimistic view that LLMs do not learn new skills during RL training. This ambiguity begs several important research questions: (1) Does an LLM learn any new skill in RL? (2) If they do, what is learned, and (3) how to incentivize the learning?

Figure 1: An simplified illustration of the string transformation prediction task that serves as the testbed of this blog.

Figure 1: An simplified illustration of the string transformation prediction task that serves as the testbed of this blog.

To answer these questions, in this blog, we provide concrete evidence that LLMs indeed learn new skills in RL, at least by composing and generalizing existing skills to solve more complex problems, and provide conditions that should be met for such learning. We design a synthetic task, string transformation prediction, which requires models to predict an output string given input strings and string transformation functions. With this decontaminated task and on-policy rollouts only, our setup enables us to investigate the problem in a clean way without any external noises and confounders. Our results show that an LLM does learn compositional skills if the RL training task requires composing atomic skills. The learned compositional skills are generalizable to more complex problems, and are even transferable to completely different tasks if corresponding atomic skills are already trained into the base model via next-token prediction (NTP) training.

Figure 2: Illustration of the RL Compositionality Hypothesis. The Base LLM first undergoes NTP training to acquire atomic skills. When only trained with RL on these atomic skills, it excels at those but fails to compose them. However, when trained with RL on even simple compositional tasks, it learns generalizable compositional skills.

Figure 2: Illustration of the RL Compositionality Hypothesis. The Base LLM first undergoes NTP training to acquire atomic skills. When only trained with RL on these atomic skills, it excels at those but fails to compose them. However, when trained with RL on even simple compositional tasks, it learns generalizable compositional skills.

Moreover, we find that even following the practice of existing work to measure pass@k accuracy, our results still challenge their conclusions, as theirs only holds true on tasks where the base model already performs well with a decent pass@k accuracy. We propose that negative results from existing literature may primarily stem from the inappropriate evaluation setup:

Vague Definitions and Flawed Proxies: Prior work uses vague definitions of "skill", often relying on proxies like the frequency of certain reasoning patterns [1, 3] or pass@k accuracy measuring the ability to solve a downstream problem [2]. While these studies show that RL amplifies behaviors already present in the base model, they crucially fail to prove that nothing else new is learned during the process.
The Pitfalls of an Aggregated pass@k: The pass@k metric fails to isolate specific skills and it is not guaranteed that everything learned can be translated into improvements on downstream tasks and vice versa. This is even more concerning on tasks where the base model already performs decently well and thus may have limited new skills to learn or does not have the incentive to learn new skills. This highlights the urgent need for a more fine-grained analysis of tasks where skills can be tested in isolation.

Figure 3: The claim that RL only reranks by looking at pass@k may be flawed.

Figure 3: The claim that RL only reranks by looking at pass@k may be flawed.