<aside> ✨

TL;DR

I analyzed the difference between the weights of DeepScaleR-1.5B-Preview and its base model DeepSeek-R1-Distill-Qwen-1.5B. Turns out that the GRPO training process of DeepScaleR introduced only low-rank changes to its base model. Furthermore, by turning the weight difference to a LoRA-like adapter with $r=64$ and adding it to the DeepSeek base model, I observed +2.3% improvement on AIME 2024, consistent across runs.

</aside>

model_name	Avg Maj@8	Avg Maj@16	Avg Maj@32
DScaler-LoraLike(r=64)-CopyOther	0.5875	0.6	0.6125
DeepScaleR-1.5B-Preview	0.527083	0.56875	0.589583
DeepSeek-R1-Distill-Qwen-1.5B	0.433333	0.526667	0.546667

Introduction

The recent release of DeepSeek-R1 paper and the collection of R1 distilled models have inspired an avalanche of open-source attempts to replicate the success of RLVR (Reinforcement Learning from Verifiable Rewards) on reasoning tasks. Many Twitter users have reported their models reaching “aha moments” on GSM8K and other toy datasets. You can also experience the “aha moment” yourself using Unsloth or with Open-R1!

Observations from open-source RL attempts

Among all the RLVR replication attempts on Twitter, a few things caught my eye:

All of them reported reaching the “aha moment” with the Qwen family very quickly - one can get decent improvements in just a few hundred of steps on GSM8K (@kalomaze, @nrehiew_, @abacaj), “it literally just works”. For Llama 2, the training took much longer (@rdolmedo_).
@winglian compared FFT, LoRA, and DoRA on GSM8K with GRPO and found that DoRA was able to converge much faster (in just 1/30th of the steps needed in FFT), with the drawback of being more unstable towards the end of training. Notably, LoRA was also much faster than FFT, just a bit behind DoRA.
Recently, @KyleLiang5 with Qwen-0.5B on GSM8K suggested that the KL penalty might be redundant and just clipping is enough. Perhaps it is connected to the fact that KL in GRPO is applied directly to the loss instead of adding to the reward? (more in @alexanderklew’s thread)
DeepScaleR-1.5B-Preview, the first and, at the time of this report, only successful open-source attempt on performing RL with long generation, showed that a mere 1750 gradient update steps with small LR is enough to achieve more than +10% on AIME’24 and AIME’25 by my own evals.

All these 3rd party observations hints to the fact that we don’t need to change the base model by much to enhance its reasoning ability (or, rather, the ability to chain bits of knowledge together and reason for longer).

Main Hypothesis

From the observations above, I formulated the following (totally non-rigorous) hypothesis:

<aside> 💡

Hypothesis: The ability to connect pre-existing “chunks of knowledge” within the model to generate longer chain of reasoning exists in low-rank space.

</aside>

I want to note that the DoRA/LoRA experiments by @winglian above does not prove this hypothesis. For one, GSM8K is a very easy dataset and does not require long CoT generation. Secondly, to fully examine this hypothesis, we need to prove that FFT with RLVR will enhance a low-rank direction.

To provide more evidence for this hypothesis, I will verify the following:

FFT with RLVR on hard olympiad problems with long reasoning traces will yield low-rank changes to the base model. Moreover, if we trim away all other dimensions of the changes, the accuracy on hard olympiad benchmarks should stay the same.
PEFT methods like LoRA and DoRA with RLVR on hard olympiad problems with long reasoning traces will converge to the same accuracy