Author: Chan Kha Vu

<aside> ✨

TL;DR

I analyzed the difference between the weights of DeepScaleR-1.5B-Preview and its base model DeepSeek-R1-Distill-Qwen-1.5B. Turns out that the GRPO training process of DeepScaleR introduced only low-rank changes to its base model. Furthermore, by turning the weight difference to a LoRA-like adapter with $r=64$ and adding it to the DeepSeek base model, I observed +2.3% improvement on AIME 2024, consistent across runs.

</aside>

image.png

model_name Avg Maj@8 Avg Maj@16 Avg Maj@32
DScaler-LoraLike(r=64)-CopyOther 0.5875 0.6 0.6125
DeepScaleR-1.5B-Preview 0.527083 0.56875 0.589583
DeepSeek-R1-Distill-Qwen-1.5B 0.433333 0.526667 0.546667

Introduction

The recent release of DeepSeek-R1 paper and the collection of R1 distilled models have inspired an avalanche of open-source attempts to replicate the success of RLVR (Reinforcement Learning from Verifiable Rewards) on reasoning tasks. Many Twitter users have reported their models reaching “aha moments” on GSM8K and other toy datasets. You can also experience the “aha moment” yourself using Unsloth or with Open-R1!

Observations from open-source RL attempts

Among all the RLVR replication attempts on Twitter, a few things caught my eye:

All these 3rd party observations hints to the fact that we don’t need to change the base model by much to enhance its reasoning ability (or, rather, the ability to chain bits of knowledge together and reason for longer).

Main Hypothesis

From the observations above, I formulated the following (totally non-rigorous) hypothesis:

<aside> 💡

Hypothesis: The ability to connect pre-existing “chunks of knowledge” within the model to generate longer chain of reasoning exists in low-rank space.

</aside>

I want to note that the DoRA/LoRA experiments by @winglian above does not prove this hypothesis. For one, GSM8K is a very easy dataset and does not require long CoT generation. Secondly, to fully examine this hypothesis, we need to prove that FFT with RLVR will enhance a low-rank direction.

To provide more evidence for this hypothesis, I will verify the following: