Introduction

Welcome to AI Pocket References: NLP Collection. This compilation covers a broad range of Natural Language Processing topics including foundational LLM concepts, architectures, prompting techniques, fine-tuning approaches, and evaluation metrics. These concise references are designed for quick understanding and practical application.

Be sure to check out our other collections of AI Pocket References!

Chain of Thought

Suggest an Edit

Reading time: 3 min

The Chain of Thought (CoT) prompting technique, introduced by Wei, Jason et al (2022), encourages an LLM to articulate its reasoning steps before arriving at a final answer to a given task.

Before its introduction, scaling LLMs had demonstrated the ability to generate coherent text and solve various tasks. However, these LLMs still underperformed on complex reasoning tasks like arithmetic and symbolic reasoning. While some prompting techniques and in-context learning had already been discovered, none had successfully enabled LLMs to handle complex reasoning tasks.

cot
Figure: LLM producing a chain of thought.

Original Implementation Details

CoT was originally introduced as a few-shot prompting technique where each included examplar is augmented with a chain of thought that explains how the final answer was determined. An example of such an examplar taken from the original paper is provided below:

# An examplar
examplar:
  question: >
    Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each
    can has 3 tennis balls. How many tennis balls does he have now?
  chain of thought: >
    Roger started with 5 balls. 2 cans of 3 tennis balls each
    is 6 tennis balls. 5 + 6 = 11.
  answer: The answer is 11.

The authors used the same set of 8 examplars across all tested benchmarks, with the exception of AQuA for which 4 examplars derived from the training set was used instead.

Performance

With larger models, CoT outperformed standard prompting across all tested reasoning benchmarks (mathematical, commonsense, and symbolic). For some of these, it even achieved state of the art results, beating out previous methods that relied on fine-tuning. However, CoT added little benefit for smaller models, leading the authors to posit it as an emergent ability of model scale.

Limitations

One of the noted limitations of CoT is the lack of guarantees on correct reasoning paths taken by the LLM. In other words, the reasoning steps that the LLM performs can be flawed, leading to inefficient token generation and potentially amplifying the issue of LLM hallucinations.

Modern Implementations

Since its introdcution, the CoT prompting technique has become more flexible. Broadly speaking, it is widely recognized as a prompting technique that elicits a chain of thought output in its generation. To do so, many include general instructions in the prompt, specifying the desired output format and other requirements. With these system instructions and output formats, CoT can also be implemented in a zero-shot fashion.

# Example CoT prompt instructions
prompt:
  system: >
    You are a helpful assistant that is able to handle complex reasoning
    tasks. To arrive at the final answer, perform chain of thought steps
    and include these in your output.

    Structure your output using the following format
      <thought>
        chain of thought here
      </thought>
      <answer>
        answer here
      </answer>
  question: >
    Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each
    can has 3 tennis balls. How many tennis balls does he have now?
  1. Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems 35 (2022): 24824-24837.

Contributors:

LoRA

Suggest an Edit Open In Colab

Reading time: 3 min

Low-rank adaptation (LoRA) is parameter-efficient fine-tuning (PEFT) introduced by Hu, Edward J. et al. (2021). The creators of LoRA posited that since trained deep learning models reside in low intrinsic dimensions, perhaps their weight-update matrices do as well.

Specifically, with LoRA, we learn a low-rank representation of the weight-update matrices of dense, linear layers of a pre-trained LLM. The original weights of the LLM are frozen during fine-tuning and only the low-rank weight-update matrices at each step of fine-tuning. This reduction in dimensionality helps to amplify the most important or influential features of the model.

lora
Figure: Illustrating a forward pass with LoRA

Some Math

Let \(W\) represent the \(d\times d\) weight matrix for a dense, linear layer. We can then loosely represent an updated version (i.e. after fine-tuning) of this matrix as follows:

$$W_{\text{updated}} = W + \Delta W,$$

where \(\Delta W\) is the update matrix. With LoRA, it is \(\Delta W\) which we project into a low-rank space:

$$\Delta W \approx AB,$$

where \(A\) and \(B^T\) are both matrices of dimension \(d \times r\) and \(r << d\). During fine-tuning, \(W\) is frozen and only \(A\) and \(B\) are updated.

For inference (i.e., forward phase), let \(x\) be an input embedding, then by the distributive property

$$xW_{\text{updated}} = xW + x\Delta W \approx xW + xAB.$$

Implementation Details

One modular implementation of LoRA involves the introduction of a LoRALayer that comprises of only the \(A\) and \(B\) dense weights. In this way, a LoRALayer can adapt a pre-trained Linear layer.

import torch


class LoRALayer(torch.nn.Module):
    """A basic LoRALayer implementation."""

    def __init__(self, d_in: int, d_out: int, rank: int):
        self.A = torch.nn.Parameter(torch.empty(d_in, rank))
        self.B = torch.nn.Parameter(torch.zeros(rank, d_out))

    def forward(self, x):
        return x @ self.A @ self.B

With the LoRALayer defined in this way, one can then combine this with a Linear layer to implement the LoRA technique. See the supplementary Colab notebook linked at the top of this pocket reference for more details.

Performance

In the original paper, the authors reported similar levels of performance when using LoRA compared to full fine-tuning on various natural language generation and understanding tasks.

Additional Benefits

Since LoRA matrices can be stored efficiently and separately from the pre-trained LLM weights, customization of these large models is highly scalable. Organizations can build libraries of specialized LoRA matrices for different datasets and domains, switching between them as needed for specific applications.

Limitations

  1. Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685, 2021.
  2. Raschka, Sebastian. Build a Large Language Model (From Scratch). Simon and Schuster, 2024.
  3. Sourab Mangrulkar et al. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods (LoRA methods), 2022.
  4. Huang, Chengsong, et al. "Lorahub: Efficient cross-task generalization via dynamic lora composition." arXiv preprint arXiv:2307.13269 (2023).
  5. Fajardo V.A. LoRA PaperCard, 2023.

Contributors:

KV Cache

Suggest an Edit

Reading time: 5 min

With autoregressive models like decoder-only LLMs (i.e., GPTs), inference is performed by predicting one token at a time, using the past token generations as inputs for future ones. To predict future tokens, certain computed representations of these past tokens are required for every future token prediction. This makes it computationally inefficient to recalculate these representations at each token generation step."

kv-cache
Figure: KV Cache for autoregressive inference

To formalize this, let \(x_1,x_2, \ldots, x_{t-1}\) represent the input sequence of \(h\) dimensional embeddings i.e., \(x_i \in R^{1\times h}\). For simplicity, let's consider a single Attention head and a single Transformer block. In order to get the logits for the next token, the LLM must compute the contextualized vector \(c_{t-1}\) given by:

$$ \begin{aligned} c_{t-1} &= f_{attn}(x_1, x_2, \ldots, x_{t-1}; W_k, W_v, W_q), \end{aligned} $$

where \(f_{attn}(\cdot)\) is the attention operator that produces a contextualized vector using all of the input embedding vectors, and \(W_k\), \(W_v\) and \(W_q\) are the \(h\times h\) projection matrices for keys, values, and queries, respectively. (Note that the Attention module computes the contextualized vectors of all input embeddings simultaneously, employing causal masking to ensure that each token only attends to itself and previous tokens in the sequence.)

Recall that with the attention operator, we first need to compute the various keys and values representations of the input embeddings as well as the query representation of \(x_{t-1}\):

$$ K_{t-1} = \begin{bmatrix} x_1 W_k \\ x_2 W_k \\ \vdots \\ x_{t-1} W_k \end{bmatrix}, \quad V_{t-1} = \begin{bmatrix} x_1 W_v \\ x_2 W_v \\ \vdots \\ x_{t-1} W_v \end{bmatrix}, \quad \text{and} \quad q_{t-1} = x_{t-1}W_q. $$

Using scaled-dot attention, we combine the keys with the query to derive an attention weights vector via:

$$ a_{t-1} = \text{Softmax}(q_{t-1} K_{t-1}^T / \sqrt{h}). $$

Finally, the contextualized vector of the (\(t-1)\)-th token is the attention-weighted combination of the values vectors:

$$ c_{t-1} = a_{t-1} V_{t-1}. $$

The LLM ultimately makes use of this contexualized vector to build the logits for the \(t\)-th token prediction. Let's suppose that \(x_t\) is generated from such logits.

With \(x_t\) generated, we aim to predict the next token. To do so, we now need to build the contextualized vector, \(c_t\):

$$ \begin{aligned} c_t &= f_{attn}(x_1, x_2, \ldots, x_{t-1}, x_t; W_k, W_v, W_q), \end{aligned} $$

As before, we understand that in order to apply this operator, the following keys, values and query are required:

$$ K_{t} = \begin{bmatrix} x_1 W_k \\ x_2 W_k \\ \vdots \\ x_{t-1} W_k \\ x_t W_k \end{bmatrix}, \quad V_{t} = \begin{bmatrix} x_1 W_v \\ x_2 W_v \\ \vdots \\ x_{t-1} W_v \\ x_t W_v \end{bmatrix}, \quad \text{and} \quad q_t = x_{t}W_q. $$

It immediately follows though that

$$ K_{t} = \begin{bmatrix} K_{t-1} \\ x_t W_k \end{bmatrix} \quad \text{and} \quad V_{t} = \begin{bmatrix} V_{t-1} \\ x_t W_v \end{bmatrix}. $$

In other words, the keys and values required to build \(c_t\) consist of all the previous keys and values needed for \(c_{t-1}\) plus only the new key and value derived from the latest input embedding token \(x_t\).

This insight presents an opportunity to significantly reduce computational overhead during generation by caching and reusing past keys and values rather than recomputing them.

This is exactly the purpose of having a KV Cache. At each iteration of inference, we compute the newest key and value emanating from the latest input embedding token and add it to the respective caches, one for each keys and values.

Algorithm: KV Cache for Autoregressive Inference

Pre-fill Stage
Given input sequence \(x_1, x_2, \ldots, x_n\)
Initialize key cache \(K_n = [x_1W_k; x_2W_k; \ldots; x_nW_k]\)
Initialize value cache \(V_n = [x_1W_v; x_2W_v; \ldots; x_nW_v]\)

Decode Stage
Loop for each token generation step t > n:
\(\quad\)Compute new key and value: \(k_t = x_t W_k\), \(v_t = x_t W_v\)
\(\quad\)Update caches by appending new key and value:
\(\qquad\)\(K_t = [K_{t-1}; k_t]\)
\(\qquad\)\(V_t = [V_{t-1}; v_t]\)
\(\quad\)Compute attention using cached keys and values:
\(\qquad\)\(q_t = x_t W_q\)
\(\qquad\)\(c_t = \text{Softmax}(q_t K_t^T / \sqrt{h}) V_t\)
\(\quad\)Compute next token logits using \(c_t\)
\(\quad\)Generate \(x_{t+1}\) // (which becomes part of the next step's input)

Limitations

Note that LLMs use Multi-Head Attention modules with several Transformer layers. Each attention head would need to maintain its own KV Cache and as such, the memory requirements can be quite expensive. As one example, Liu et al (2024) note that with 540B PaLM, using a batch size of 512 and context length of 2048, would required a KV Cache that can take upwards of 3TB of memory — more than the amount of memory required to hold the model's weight parameters.

This memory bottleneck becomes especially pronounced when serving multiple requests simultaneously or working with very long context windows.

  1. Liu, Zirui, et al. "Kivi: A tuning-free asymmetric 2bit quantization for kv cache." arXiv preprint arXiv:2402.02750 (2024).
  2. Raschka, Sebastian. Build a Large Language Model (From Scratch). Simon and Schuster, 2024.
  3. Rajan, R. "KV Cache - Understanding the Mechanism behind it." R4J4N Blogs, r4j4n.github.io/blogs/posts/kv/. Accessed 27 Feb. 2025.

Contributors:

DeepSeek-R1

Suggest an Edit

Reading time: 7 min

The DeepSeek-R1 model was introduced by DeepSeek in January of 2025. It is derived from an earlier checkpoint of DeepSeek-V3. In particular, starting with DeepSeek-V3-base, four stages of fine-tuning were performed in order to arrive at the checkpoint known as DeepSeek-R1: (i) Reasoning Cold-Start (using SFT), (ii) RL for Reasoning (using GRPO), (iii) SFT for Enhanced Reasoning & General Capabilities (using RL-generated reasoning data sampled with Rejection Sampling), and (iv) RL for Alignment (to human preferences).

Lineage

Figure: Illustrating DeepSeek-R1 model evolution.

As illustrated in the Figure above, the model lineage of DeepSeek-R1 implements a full-scale RL for reasoning stage that leverages cold-start data. In contrast, DeepSeek-R1-Zero does not use any cold-start SFT data whatsoever and uses purely RL steps to acquire its reasoning capabilities. The reward signal used for guiding the RL process of DeepSeek-R1-Zero is rules based computed from the response's correctness as well as its adherence to the desired format. While DeepSeek-R1-Zero demonstrated remarkable reasoning capabilities, it suffered greatly from poor readability and language mixing.

This motivated the usage of cold-start data in the RL for Reasoning stage of DeepSeek-R1's training. Additionally, a reward signal to reduce language mixing as well as a model-based reward (using DeepSeek-V3 for judgement) was also incorporated.

Historical Significance

At the time of its release, LLM reasoning models such as the OpenAI's o-series models had demonstrated remarkable performance on complex tasks, including those requiring multiple steps (e.g., OpenAI o3's breakthrough score on ARC-AGI). However, OpenAI—operating under a closed-source model—provided little details to how these models were developed, merely mentioning that Reinforcement Learning techniques were used to train the LLMs to produce long (internal) chain-of-thought style reasoning prior to providing a final answer.

In contrast, DeepSeek open-sourced DeepSeek-R1 and provided a very detailed technical report, shedding much light on its training pipeline, which included an RL approach for the model to acquire its reasoning capabilities. It was also reported that DeepSeek-R1 was trained on NVIDIA H800's, a less capable GPU than the NVIDIA H100 or A100.

DeepSeek-V3 is trained on a cluster equipped with 2048 NVIDIA H800 GPUs.

(quoted from the DeepSeek-V3 Technical Report)

The fact that DeepSeek-R1's performance rivaled that of it's closed-source counterpart in OpenAI o3 on multiple benchmarks (using reportedly less compute) led to a frenzy in the LLM and broader AI community. As an example, many teams (including at least one from HuggingFace) worked tirelessly to produce their own versions of DeepSeek-R1 in the days after its release.

Architectural Highlights

See DeepSeek-V3.

Training Data

The training data used for the four stages are described below:

Reasoning Cold Start: 1000s of samples of long CoT passages from multiple domains, verified by human annotators was used.

RL for Reasoning: self-exploration, using increased test-time for RL discovery until convergence (referred to as the RL checkpoint).

SFT for Enhanced Reasoning & General Capabilities: the RL checkpoint was then used to generate 600K reasoning related samples (using rejection sampling). DeepSeek-V3 was used to create 200K non-reasoning data omitting the CoT portion for simple queries.

RL for Alignment: a combination of reward signals diverse data distributions including preference pairs and analyses of generated summaries & responses.

Key Results

Below are three key results of DeepSeek-R1 and its development:

Benchmark (Metric)Claude-3.5-Sonnet-1022GPT-4 0513DeepSeek-V3OpenAI o1-miniOpenAI o1-1217DeepSeek-R1
MMLU (Pass@1)88.387.288.585.291.890.8
MMLU-Redux (EM)88.988.089.186.7-92.9
MMLU-Pro (EM)78.072.675.980.3-84.0
DROP (3-shot F1)88.383.791.683.990.292.2
IF-Eval (Prompt Strict)86.584.386.184.8-83.3
GFQA Diamond (Pass@1)65.049.959.160.075.771.5
SimpleQA (Correct)28.438.224.97.047.030.1
FRAMES (Acc.)72.580.573.376.9-82.5
AlpacaEval2.0 (LC-winrate)52.051.170.057.8-87.6
ArenaHard (GPT-4-1106)85.280.485.592.0-92.3
LiveCodeBench (Pass@1-COT)38.932.936.253.863.465.9
Codeforces (Percentile)20.323.658.793.496.696.3
Codeforces (Rating)7177591134182020612029
SWE Verified (Resolved)50.838.842.041.648.949.2
Aider-Polyglot (Acc.)45.316.049.632.961.753.3
AIME 2024 (Pass@1)16.09.339.263.679.279.8
MATH-500 (Pass@1)78.374.690.290.096.497.3
CNMO 2024 (Pass@1)13.110.843.267.6-78.8
CLUEWSC (EM)85.487.990.989.9-92.8
C-Eval (EM)76.776.086.568.9-91.8
C-SimpleQA (Correct)55.458.768.040.3-63.7

Table: Comparison between DeepSeek-R1 and other representative models. (Copied from Table 4 of Guo, Daya, et al (2025).)

  1. Performance on Benchmarks: The table above which was copied from the DeepSeek-R1 paper compares the performance of DeepSeek-R1 and -V3 with representative models from Anthropic and OpenAI. The values reported clearly demonstrate the impressive performance of DeepSeek-R1 across various benchmarks and tasks. Most notably, DeepSeek-R1 was able to surpass OpenAI's reasoning model o1-1217 on several benchmarks.

  2. Distilling Reasoning Capabilities: The 800K samples that included generated examples by both DeepSeek-R1 (reasoning) and DeepSeek-V3 (non-reasoning) were used to distill other open-source models like Qwen and Llama. With only the application SFT (i.e., no RL), some of these distilled models were not only able to outperform OpenAI's non-reasoning model GPT-4o-0513 across all benchmarks tested, but also OpenAI's o1-mini model on most benchmarks.

  3. RL's Potential: Pure RL empowered DeepSeek-R1-Zero to autonomously acquire robust reasoning capabilities without any SFT data. What's more is that as test-time computation was increased, desirable behaviours such as reflection and re-evaluation on past trajectories emerged making it possible for the model to have "aha moments" when solving complex tasks. This development should serve as a reminder of the great potential of RL and its overall place in AI as endeavour to reach new heights.

Limitations

DeepSeek reported various limitations for DeepSeek-R1. Most notably, DeepSeek-R1 is inferior to DeepSeek-V3 in general capabilities such as function calling, producing structured outputs (JSON), role-playing, and multi-turn conversations. Additionally, due to its optimization for English and Chinese, the model sometimes suffers from language mixing. Lastly, DeepSeek-R1 reportedly demonstrated a high sensitivity to prompts and long inference times, making it unsuitable for low-latency applications such as software-engineering tasks.

  1. Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).
  2. Liu, Aixin, et al. "Deepseek-v3 technical report." arXiv preprint arXiv:2412.19437 (2024).
  3. China's DeepSeek sets off Nvidia investor panic over US export controls (appearing in fortune.com)
  4. Open-R1: a fully open reproduction of DeepSeek-R1 (by HuggingFace)
  5. DeepSeek-R1 is available on HuggingFace

Contributors:

DeepSeek-v3

Suggest an Edit

Reading time: 7 min

The DeepSeek-V3 model was introduced by DeepSeek in December of 2024. It is an LLM that leverages MoE in its design.

DeepSeek-V3 Model Lineage
Figure: Illustrating DeepSeek-V3 training evolution.

The training pipeline for DeepSeek-V3 consists of the two typical stages: pre-training and post-training. As depicted in the Figure above, the pre-training stage involves pre-training on 14.8T tokens followed by long-context extension using the YaRN methodology. Post-training of DeepSeek-V3 utilizes SFT as well as Reinforcement Learning methods.

Historical Significance

At the time of its release, open-source models had already been lessening the gap in performance with closed-source counterparts. DeepSeek-V3 was yet another open-source model that achieved high levels of performance, beating other open-source alternatives as well as some closed-source models in various benchmarks. What made DeepSeek-V3's achievement even more intriguing was that it was reportedly trained using less compute than its closest counterparts.

Architectural Highlights

DeepSeek-V3 is a transformer-based model that swaps out nearly all dense feedforward for MoE. The model has a total of 671B parameters but through its specialized variant of MoE (referred to as DeepSeekMoE), only 37B parameters are activated in both training and inference. Through a series of long-context extension fine-tuning steps, the maximum context length for this model was extended to 128K tokens.

DeepSeekMoE: Used to carry out training more efficiently, this MoE design consists of two sets of experts, namely: shared and routed. The former set of routers is used for every token in the input sequence whereas the usage of routed ones are determined according to the affinity to the input token.

Auxiliary-loss Load Free Balancing: When using an MoE architecture, one must consider load balancing across the networks to prevent routing collapse. This has been typically addressed via the introduction of an auxiliary loss. However, if this loss has too great of an influence, it can lead to a model degradation. DeepSeek-V3 instead considers a technique that requires no auxiliary loss but instead relies on a new bias term that dynamically changes its value according to the experts current workload.

Multi-Head Latent Attention (MLA): Used for making inference more efficient by jointly compressing attention keys and values to a lower dimension. The compression involves a linear projection matrix compressing keys and values down as well as another linear project matrix for compressing keys and values back up. Only the compressed joint representation of keys and values need to be cached during inference. For more details see MLA.

Multi-Token Prediction: In an effort to improve the training signal, DeepSeek-V3 expands the prediction scope to additional future tokens at every token position of the sequence. In other words, instead of only predicting the next immediate token and training the model on this signal, $D$ future tokens are predicted. These tokens are predicted sequentially by $D$ sequential multi-token prediction modules in order to maintain the causal chain. For more details see MTP.

ParameterValue
Total parameters671B
Activated parameters37B
Maximum context length128K tokens
Number of Transformer layers61
Hidden dimension size7168
Number of attention heads128
Number of experts (MoE)1 (shared) & 256 (routed)
Hidden dimension of experts2048
KV compression dimension size (MLA)512
Multi-token depth (MTP)1
Table 1: Summary of DeepSeek-V3 architecture and hyper parameters.

Training Data

The pre-training corpus is a revised version of the one used to train an earlier version of the model, DeepSeek-V2. In this revision, more samples pertaining to mathematics and programming were included. Ultimately, the dataset comprised of 14.8T tokens.

Compute Details

DeepSeek-V3 was trained on a cluster with 2048 NVIDIA H800 GPUs. Each node within the cluster consists of 8 H800 GPUs inter-connected via NVLink and NVSwitch. In total, it was reported that only 2.664M H800 GPU hours were used for pre-training while subsequent training stages required only 0.1M GPU hours. One of the main reasons for this training efficiency was their application of an FP8 mixed precision training framework.

Key Results

Benchmark (Metric)# ShotsDeepSeek-V2 BaseQwen2.5 72B BaseLLaMA-3.1 405B BaseDeepSeek-V3 Base
Pile-test (BPB)-0.6060.6380.5420.548
BBH (EM)3-shot78.879.882.987.5
MMLU (EM)5-shot78.485.084.487.1
MMLU-Redux (EM)5-shot75.683.281.386.2
MMLU-Pro (EM)5-shot51.458.352.864.4
DROP (F1)3-shot80.480.686.089.0
ARC-Easy (EM)25-shot97.698.498.498.9
ARC-Challenge (EM)25-shot92.294.595.395.3
HellaSwag (EM)10-shot87.184.889.288.9
PIQA (EM)0-shot83.982.185.984.7
WinoGrande (EM)5-shot86.382.385.284.9
RACE-Middle (EM)3-shot73.168.174.274.9
RACE-High (EM)5-shot52.650.356.851.3
TriviaQA (EM)5-shot80.071.982.782.9
NaturalQuestions (EM)5-shot38.633.241.540.0
AGIEval (EM)0-shot57.575.860.679.6
HumanEval (Pass@1)0-shot43.353.054.965.2
MBPP (Pass@1)3-shot65.072.668.475.4
LiveCodeBench-Base (Pass@1)3-shot11.612.915.119.4
CRUXEval-1 (EM)2-shot52.559.158.567.3
CRUXEval-O (EM)2-shot49.859.959.969.8
CSMRR (EM)8-shot81.688.389.389.3
MATH (EM)4-shot43.454.449.061.6
MGSM (EM)8-shot63.676.269.979.8
CMath (EM)3-shot78.784.577.390.7
CLUEWSC (EM)5-shot82.082.583.082.7
C-Eval (EM)0-shot81.472.572.590.1
CMMLU (EM)5-shot84.089.573.788.8
CMRC (EM)1-shot77.475.876.076.3
C3 (EM)0-shot77.476.779.778.6
CCPM (EM)0-shot93.088.578.692.0
MMLU-non-English (EM)5-shot64.074.873.879.4
Table 2: Comparison between DeepSeek-V3 and other representative models. (Copied from Table 3 of Liu, Aixin, et al (2024).)
  1. Superior Open-Source Model: DeepSeek-V3 outperformed all other open-source models on educational benchmarks (MMLU, MMLU-Pro, GPQA) achieving performance levels that rivals that for closed-source models such as GPT-4o and Claude-Sonnet-3.5. DeepSeek-V3 also achieved SOTA on math-related benchmarks (GSM8K, MATH, MGSM, CMath).

  2. Efficient Training: DeepSeek-V3 was trained using only 2.664M H800 GPU hours, leveraging an FP8 mixed precision training framework. This marked, as reported by the authors, the first successful use of an FP8 scheme to train a large-scale model.

  3. Reasoning Distillation: As part of the post-training step, DeepSeek-V3 creators were able to distill reasoning capabilities via long CoT passages generated by DeepSeek-R1. The authors noted that this pipeline improved reasononing performance while still maintaining the ability to produce desired outputs and efficient response lengths.

Limitations

DeepSeek-V3 requires significant amounts of computation facilities to ensure efficient inference.

  1. Liu, Aixin, et al. "Deepseek-v3 technical report." arXiv preprint arXiv:2412.19437 (2024).
  2. DeepSeek sparks AI stock selloff; Nvidia posts record market-cap loss (appearing in reuters.com)

Contributors: