DeepSeek-R1

Suggest an Edit

Reading time: 7 min

The DeepSeek-R1 model was introduced by DeepSeek in January of 2025. It is derived from an earlier checkpoint of DeepSeek-V3. In particular, starting with DeepSeek-V3-base, four stages of fine-tuning were performed in order to arrive at the checkpoint known as DeepSeek-R1: (i) Reasoning Cold-Start (using SFT), (ii) RL for Reasoning (using GRPO), (iii) SFT for Enhanced Reasoning & General Capabilities (using RL-generated reasoning data sampled with Rejection Sampling), and (iv) RL for Alignment (to human preferences).

Lineage

Figure: Illustrating DeepSeek-R1 model evolution.

As illustrated in the Figure above, the model lineage of DeepSeek-R1 implements a full-scale RL for reasoning stage that leverages cold-start data. In contrast, DeepSeek-R1-Zero does not use any cold-start SFT data whatsoever and uses purely RL steps to acquire its reasoning capabilities. The reward signal used for guiding the RL process of DeepSeek-R1-Zero is rules based computed from the response's correctness as well as its adherence to the desired format. While DeepSeek-R1-Zero demonstrated remarkable reasoning capabilities, it suffered greatly from poor readability and language mixing.

This motivated the usage of cold-start data in the RL for Reasoning stage of DeepSeek-R1's training. Additionally, a reward signal to reduce language mixing as well as a model-based reward (using DeepSeek-V3 for judgement) was also incorporated.

Historical Significance

At the time of its release, LLM reasoning models such as the OpenAI's o-series models had demonstrated remarkable performance on complex tasks, including those requiring multiple steps (e.g., OpenAI o3's breakthrough score on ARC-AGI). However, OpenAI—operating under a closed-source model—provided little details to how these models were developed, merely mentioning that Reinforcement Learning techniques were used to train the LLMs to produce long (internal) chain-of-thought style reasoning prior to providing a final answer.

In contrast, DeepSeek open-sourced DeepSeek-R1 and provided a very detailed technical report, shedding much light on its training pipeline, which included an RL approach for the model to acquire its reasoning capabilities. It was also reported that DeepSeek-R1 was trained on NVIDIA H800's, a less capable GPU than the NVIDIA H100 or A100.

DeepSeek-V3 is trained on a cluster equipped with 2048 NVIDIA H800 GPUs.

(quoted from the DeepSeek-V3 Technical Report)

The fact that DeepSeek-R1's performance rivaled that of it's closed-source counterpart in OpenAI o3 on multiple benchmarks (using reportedly less compute) led to a frenzy in the LLM and broader AI community. As an example, many teams (including at least one from HuggingFace) worked tirelessly to produce their own versions of DeepSeek-R1 in the days after its release.

Architectural Highlights

See DeepSeek-V3.

Training Data

The training data used for the four stages are described below:

Reasoning Cold Start: 1000s of samples of long CoT passages from multiple domains, verified by human annotators was used.

RL for Reasoning: self-exploration, using increased test-time for RL discovery until convergence (referred to as the RL checkpoint).

SFT for Enhanced Reasoning & General Capabilities: the RL checkpoint was then used to generate 600K reasoning related samples (using rejection sampling). DeepSeek-V3 was used to create 200K non-reasoning data omitting the CoT portion for simple queries.

RL for Alignment: a combination of reward signals diverse data distributions including preference pairs and analyses of generated summaries & responses.

Key Results

Below are three key results of DeepSeek-R1 and its development:

Benchmark (Metric)Claude-3.5-Sonnet-1022GPT-4 0513DeepSeek-V3OpenAI o1-miniOpenAI o1-1217DeepSeek-R1
MMLU (Pass@1)88.387.288.585.291.890.8
MMLU-Redux (EM)88.988.089.186.7-92.9
MMLU-Pro (EM)78.072.675.980.3-84.0
DROP (3-shot F1)88.383.791.683.990.292.2
IF-Eval (Prompt Strict)86.584.386.184.8-83.3
GFQA Diamond (Pass@1)65.049.959.160.075.771.5
SimpleQA (Correct)28.438.224.97.047.030.1
FRAMES (Acc.)72.580.573.376.9-82.5
AlpacaEval2.0 (LC-winrate)52.051.170.057.8-87.6
ArenaHard (GPT-4-1106)85.280.485.592.0-92.3
LiveCodeBench (Pass@1-COT)38.932.936.253.863.465.9
Codeforces (Percentile)20.323.658.793.496.696.3
Codeforces (Rating)7177591134182020612029
SWE Verified (Resolved)50.838.842.041.648.949.2
Aider-Polyglot (Acc.)45.316.049.632.961.753.3
AIME 2024 (Pass@1)16.09.339.263.679.279.8
MATH-500 (Pass@1)78.374.690.290.096.497.3
CNMO 2024 (Pass@1)13.110.843.267.6-78.8
CLUEWSC (EM)85.487.990.989.9-92.8
C-Eval (EM)76.776.086.568.9-91.8
C-SimpleQA (Correct)55.458.768.040.3-63.7

Table: Comparison between DeepSeek-R1 and other representative models. (Copied from Table 4 of Guo, Daya, et al (2025).)

  1. Performance on Benchmarks: The table above which was copied from the DeepSeek-R1 paper compares the performance of DeepSeek-R1 and -V3 with representative models from Anthropic and OpenAI. The values reported clearly demonstrate the impressive performance of DeepSeek-R1 across various benchmarks and tasks. Most notably, DeepSeek-R1 was able to surpass OpenAI's reasoning model o1-1217 on several benchmarks.

  2. Distilling Reasoning Capabilities: The 800K samples that included generated examples by both DeepSeek-R1 (reasoning) and DeepSeek-V3 (non-reasoning) were used to distill other open-source models like Qwen and Llama. With only the application SFT (i.e., no RL), some of these distilled models were not only able to outperform OpenAI's non-reasoning model GPT-4o-0513 across all benchmarks tested, but also OpenAI's o1-mini model on most benchmarks.

  3. RL's Potential: Pure RL empowered DeepSeek-R1-Zero to autonomously acquire robust reasoning capabilities without any SFT data. What's more is that as test-time computation was increased, desirable behaviours such as reflection and re-evaluation on past trajectories emerged making it possible for the model to have "aha moments" when solving complex tasks. This development should serve as a reminder of the great potential of RL and its overall place in AI as endeavour to reach new heights.

Limitations

DeepSeek reported various limitations for DeepSeek-R1. Most notably, DeepSeek-R1 is inferior to DeepSeek-V3 in general capabilities such as function calling, producing structured outputs (JSON), role-playing, and multi-turn conversations. Additionally, due to its optimization for English and Chinese, the model sometimes suffers from language mixing. Lastly, DeepSeek-R1 reportedly demonstrated a high sensitivity to prompts and long inference times, making it unsuitable for low-latency applications such as software-engineering tasks.

  1. Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).
  2. Liu, Aixin, et al. "Deepseek-v3 technical report." arXiv preprint arXiv:2412.19437 (2024).
  3. China's DeepSeek sets off Nvidia investor panic over US export controls (appearing in fortune.com)
  4. Open-R1: a fully open reproduction of DeepSeek-R1 (by HuggingFace)
  5. DeepSeek-R1 is available on HuggingFace

Contributors: