DeepSeek-v3

Suggest an Edit

Reading time: 7 min

The DeepSeek-V3 model was introduced by DeepSeek in December of 2024. It is an LLM that leverages MoE in its design.

DeepSeek-V3 Model Lineage
Figure: Illustrating DeepSeek-V3 training evolution.

The training pipeline for DeepSeek-V3 consists of the two typical stages: pre-training and post-training. As depicted in the Figure above, the pre-training stage involves pre-training on 14.8T tokens followed by long-context extension using the YaRN methodology. Post-training of DeepSeek-V3 utilizes SFT as well as Reinforcement Learning methods.

Historical Significance

At the time of its release, open-source models had already been lessening the gap in performance with closed-source counterparts. DeepSeek-V3 was yet another open-source model that achieved high levels of performance, beating other open-source alternatives as well as some closed-source models in various benchmarks. What made DeepSeek-V3's achievement even more intriguing was that it was reportedly trained using less compute than its closest counterparts.

Architectural Highlights

DeepSeek-V3 is a transformer-based model that swaps out nearly all dense feedforward for MoE. The model has a total of 671B parameters but through its specialized variant of MoE (referred to as DeepSeekMoE), only 37B parameters are activated in both training and inference. Through a series of long-context extension fine-tuning steps, the maximum context length for this model was extended to 128K tokens.

DeepSeekMoE: Used to carry out training more efficiently, this MoE design consists of two sets of experts, namely: shared and routed. The former set of routers is used for every token in the input sequence whereas the usage of routed ones are determined according to the affinity to the input token.

Auxiliary-loss Load Free Balancing: When using an MoE architecture, one must consider load balancing across the networks to prevent routing collapse. This has been typically addressed via the introduction of an auxiliary loss. However, if this loss has too great of an influence, it can lead to a model degradation. DeepSeek-V3 instead considers a technique that requires no auxiliary loss but instead relies on a new bias term that dynamically changes its value according to the experts current workload.

Multi-Head Latent Attention (MLA): Used for making inference more efficient by jointly compressing attention keys and values to a lower dimension. The compression involves a linear projection matrix compressing keys and values down as well as another linear project matrix for compressing keys and values back up. Only the compressed joint representation of keys and values need to be cached during inference. For more details see MLA.

Multi-Token Prediction: In an effort to improve the training signal, DeepSeek-V3 expands the prediction scope to additional future tokens at every token position of the sequence. In other words, instead of only predicting the next immediate token and training the model on this signal, $D$ future tokens are predicted. These tokens are predicted sequentially by $D$ sequential multi-token prediction modules in order to maintain the causal chain. For more details see MTP.

ParameterValue
Total parameters671B
Activated parameters37B
Maximum context length128K tokens
Number of Transformer layers61
Hidden dimension size7168
Number of attention heads128
Number of experts (MoE)1 (shared) & 256 (routed)
Hidden dimension of experts2048
KV compression dimension size (MLA)512
Multi-token depth (MTP)1
Table 1: Summary of DeepSeek-V3 architecture and hyper parameters.

Training Data

The pre-training corpus is a revised version of the one used to train an earlier version of the model, DeepSeek-V2. In this revision, more samples pertaining to mathematics and programming were included. Ultimately, the dataset comprised of 14.8T tokens.

Compute Details

DeepSeek-V3 was trained on a cluster with 2048 NVIDIA H800 GPUs. Each node within the cluster consists of 8 H800 GPUs inter-connected via NVLink and NVSwitch. In total, it was reported that only 2.664M H800 GPU hours were used for pre-training while subsequent training stages required only 0.1M GPU hours. One of the main reasons for this training efficiency was their application of an FP8 mixed precision training framework.

Key Results

Benchmark (Metric)# ShotsDeepSeek-V2 BaseQwen2.5 72B BaseLLaMA-3.1 405B BaseDeepSeek-V3 Base
Pile-test (BPB)-0.6060.6380.5420.548
BBH (EM)3-shot78.879.882.987.5
MMLU (EM)5-shot78.485.084.487.1
MMLU-Redux (EM)5-shot75.683.281.386.2
MMLU-Pro (EM)5-shot51.458.352.864.4
DROP (F1)3-shot80.480.686.089.0
ARC-Easy (EM)25-shot97.698.498.498.9
ARC-Challenge (EM)25-shot92.294.595.395.3
HellaSwag (EM)10-shot87.184.889.288.9
PIQA (EM)0-shot83.982.185.984.7
WinoGrande (EM)5-shot86.382.385.284.9
RACE-Middle (EM)3-shot73.168.174.274.9
RACE-High (EM)5-shot52.650.356.851.3
TriviaQA (EM)5-shot80.071.982.782.9
NaturalQuestions (EM)5-shot38.633.241.540.0
AGIEval (EM)0-shot57.575.860.679.6
HumanEval (Pass@1)0-shot43.353.054.965.2
MBPP (Pass@1)3-shot65.072.668.475.4
LiveCodeBench-Base (Pass@1)3-shot11.612.915.119.4
CRUXEval-1 (EM)2-shot52.559.158.567.3
CRUXEval-O (EM)2-shot49.859.959.969.8
CSMRR (EM)8-shot81.688.389.389.3
MATH (EM)4-shot43.454.449.061.6
MGSM (EM)8-shot63.676.269.979.8
CMath (EM)3-shot78.784.577.390.7
CLUEWSC (EM)5-shot82.082.583.082.7
C-Eval (EM)0-shot81.472.572.590.1
CMMLU (EM)5-shot84.089.573.788.8
CMRC (EM)1-shot77.475.876.076.3
C3 (EM)0-shot77.476.779.778.6
CCPM (EM)0-shot93.088.578.692.0
MMLU-non-English (EM)5-shot64.074.873.879.4
Table 2: Comparison between DeepSeek-V3 and other representative models. (Copied from Table 3 of Liu, Aixin, et al (2024).)
  1. Superior Open-Source Model: DeepSeek-V3 outperformed all other open-source models on educational benchmarks (MMLU, MMLU-Pro, GPQA) achieving performance levels that rivals that for closed-source models such as GPT-4o and Claude-Sonnet-3.5. DeepSeek-V3 also achieved SOTA on math-related benchmarks (GSM8K, MATH, MGSM, CMath).

  2. Efficient Training: DeepSeek-V3 was trained using only 2.664M H800 GPU hours, leveraging an FP8 mixed precision training framework. This marked, as reported by the authors, the first successful use of an FP8 scheme to train a large-scale model.

  3. Reasoning Distillation: As part of the post-training step, DeepSeek-V3 creators were able to distill reasoning capabilities via long CoT passages generated by DeepSeek-R1. The authors noted that this pipeline improved reasononing performance while still maintaining the ability to produce desired outputs and efficient response lengths.

Limitations

DeepSeek-V3 requires significant amounts of computation facilities to ensure efficient inference.

  1. Liu, Aixin, et al. "Deepseek-v3 technical report." arXiv preprint arXiv:2412.19437 (2024).
  2. DeepSeek sparks AI stock selloff; Nvidia posts record market-cap loss (appearing in reuters.com)

Contributors: