Author: Chan Kha Vu
<aside> ✨
I analyzed the difference between the weights of DeepScaleR-1.5B-Preview
and its base model DeepSeek-R1-Distill-Qwen-1.5B
. Turns out that the GRPO training process of DeepScaleR introduced only low-rank changes to its base model. Furthermore, by turning the weight difference to a LoRA-like adapter with $r=64$ and adding it to the DeepSeek base model, I observed +2.3% improvement on AIME 2024, consistent across runs.
</aside>
model_name | Avg Maj@8 | Avg Maj@16 | Avg Maj@32 |
---|---|---|---|
DScaler-LoraLike(r=64)-CopyOther | 0.5875 | 0.6 | 0.6125 |
DeepScaleR-1.5B-Preview | 0.527083 | 0.56875 | 0.589583 |
DeepSeek-R1-Distill-Qwen-1.5B | 0.433333 | 0.526667 | 0.546667 |
The recent release of DeepSeek-R1 paper and the collection of R1 distilled models have inspired an avalanche of open-source attempts to replicate the success of RLVR (Reinforcement Learning from Verifiable Rewards) on reasoning tasks. Many Twitter users have reported their models reaching “aha moments” on GSM8K and other toy datasets. You can also experience the “aha moment” yourself using Unsloth or with Open-R1!
Among all the RLVR replication attempts on Twitter, a few things caught my eye:
All these 3rd party observations hints to the fact that we don’t need to change the base model by much to enhance its reasoning ability (or, rather, the ability to chain bits of knowledge together and reason for longer).
From the observations above, I formulated the following (totally non-rigorous) hypothesis:
<aside> 💡
Hypothesis: The ability to connect pre-existing “chunks of knowledge” within the model to generate longer chain of reasoning exists in low-rank space.
</aside>
I want to note that the DoRA/LoRA experiments by @winglian above does not prove this hypothesis. For one, GSM8K is a very easy dataset and does not require long CoT generation. Secondly, to fully examine this hypothesis, we need to prove that FFT with RLVR will enhance a low-rank direction.
To provide more evidence for this hypothesis, I will verify the following: