Introduction
Welcome to AI Pocket References: NLP Collection. This compilation covers a broad range of Natural Language Processing topics including foundational LLM concepts, architectures, prompting techniques, fine-tuning approaches, and evaluation metrics. These concise references are designed for quick understanding and practical application.
Be sure to check out our other collections of AI Pocket References!
Chain of Thought
The Chain of Thought (CoT) prompting technique, introduced by Wei, Jason et al (2022), encourages an LLM to articulate its reasoning steps before arriving at a final answer to a given task.
Before its introduction, scaling LLMs had demonstrated the ability to generate coherent text and solve various tasks. However, these LLMs still underperformed on complex reasoning tasks like arithmetic and symbolic reasoning. While some prompting techniques and in-context learning had already been discovered, none had successfully enabled LLMs to handle complex reasoning tasks.
Original Implementation Details
CoT was originally introduced as a few-shot prompting technique where each included exemplar is augmented with a chain of thought that explains how the final answer was determined. An example of such an exemplar taken from the original paper is provided below:
# An exemplar
exemplar:
question: >
Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each
can has 3 tennis balls. How many tennis balls does he have now?
chain of thought: >
Roger started with 5 balls. 2 cans of 3 tennis balls each
is 6 tennis balls. 5 + 6 = 11.
answer: The answer is 11.
The authors used the same set of 8 exemplars across all tested benchmarks, with the exception of AQuA, for which 4 exemplars derived from the training set were used instead."
Performance
With larger models, CoT outperformed standard prompting across all tested reasoning benchmarks (mathematical, commonsense, and symbolic). For some of these, it even achieved state of the art results, beating out previous methods that relied on fine-tuning. However, CoT added little benefit for smaller models, leading the authors to posit it as an emergent ability of model scale.
Limitations
One of the noted limitations of CoT is the lack of guarantees on correct reasoning paths taken by the LLM. In other words, the reasoning steps that the LLM performs can be flawed, leading to inefficient token generation and potentially amplifying the issue of LLM hallucinations.
Modern Implementations
Since its introduction, the CoT prompting technique has become more flexible. Broadly speaking, it is widely recognized as a prompting technique that elicits a chain of thought output in its generation. To do so, many include general instructions in the prompt, specifying the desired output format and other requirements. With these system instructions and output formats, CoT can also be implemented in a zero-shot fashion.
# Example CoT prompt instructions
prompt:
system: >
You are a helpful assistant that is able to handle complex reasoning
tasks. To arrive at the final answer, perform chain of thought steps
and include these in your output.
Structure your output using the following format
<thought>
chain of thought here
</thought>
<answer>
answer here
</answer>
question: >
Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each
can has 3 tennis balls. How many tennis balls does he have now?
References & Useful Links
LoRA
Low-rank adaptation (LoRA) is parameter-efficient fine-tuning (PEFT) introduced by Hu, Edward J. et al. (2021). The creators of LoRA posited that since trained deep learning models reside in low intrinsic dimensions, perhaps their weight-update matrices do as well.
Specifically, with LoRA, we learn a low-rank representation of the weight-update matrices of dense, linear layers of a pre-trained LLM. The original weights of the LLM are frozen during fine-tuning and only the low-rank weight-update matrices at each step of fine-tuning. This reduction in dimensionality helps to amplify the most important or influential features of the model.
Some Math
Let \(W\) represent the \(d\times d\) weight matrix for a dense, linear layer. We can then loosely represent an updated version (i.e. after fine-tuning) of this matrix as follows:
$$W_{\text{updated}} = W + \Delta W,$$
where \(\Delta W\) is the update matrix. With LoRA, it is \(\Delta W\) which we project into a low-rank space:
$$\Delta W \approx AB,$$
where \(A\) and \(B^T\) are both matrices of dimension \(d \times r\) and \(r << d\). During fine-tuning, \(W\) is frozen and only \(A\) and \(B\) are updated.
For inference (i.e., forward phase), let \(x\) be an input embedding, then by the distributive property
$$xW_{\text{updated}} = xW + x\Delta W \approx xW + xAB.$$
Implementation Details
One modular implementation of LoRA involves the introduction of a LoRALayer
that
comprises of only the \(A\) and \(B\) dense weights. In this way, a LoRALayer
can adapt a pre-trained Linear
layer.
import torch
class LoRALayer(torch.nn.Module):
"""A basic LoRALayer implementation."""
def __init__(self, d_in: int, d_out: int, rank: int):
self.A = torch.nn.Parameter(torch.empty(d_in, rank))
self.B = torch.nn.Parameter(torch.zeros(rank, d_out))
def forward(self, x):
return x @ self.A @ self.B
With the LoRALayer
defined in this way, one can then combine this with a Linear
layer to implement the LoRA technique. See the supplementary Colab notebook linked
at the top of this pocket reference for more details.
Performance
In the original paper, the authors reported similar levels of performance when using LoRA compared to full fine-tuning on various natural language generation and understanding tasks.
Additional Benefits
Since LoRA matrices can be stored efficiently and separately from the pre-trained LLM weights, customization of these large models is highly scalable. Organizations can build libraries of specialized LoRA matrices for different datasets and domains, switching between them as needed for specific applications.
Limitations
References & Useful Links
- Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685, 2021.
- Raschka, Sebastian. Build a Large Language Model (From Scratch). Simon and Schuster, 2024.
- Sourab Mangrulkar et al. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods (LoRA methods), 2022.
- Huang, Chengsong, et al. "Lorahub: Efficient cross-task generalization via dynamic lora composition." arXiv preprint arXiv:2307.13269 (2023).
- Fajardo V.A. LoRA PaperCard, 2023.
Agents
Agents are LLMs equipped with tools and memory to interact with the environment and complete specific, user-defined objectives. They go about this by following workflows which direct them in (i) planning for what steps and tools are needed, (ii) executing an action, and (iii) reflecting on feedback from their action, looping through these steps when the initial plan requires multiple actions, or when reflection suggests additional actions are needed to achieve the objective. Compared to an LLM on its own, the "plan-action-reflect" workflow of agents give them a higher degree of agency and capacity for complex or long term tasks, while tools offer the ability to learn up-to-date information from the environment and offload certain computations.
Components
Tools
Tools can include external data sets (including unstructured data such as PDF documents), web searches, APIs, custom functions, and even other agents. They should fulfill a clear objective which is clearly communicated to the LLM through a short, formatted description. This is so the LLM is aware of the tool's existence and can invoke it when necessary. However, since an LLM's input and output are text-based, invoking a tool means generating a structured output in a specific format—typically JSON or direct code. Standardized communication between tools and LLMs are also being established and propagated, with Anthropic's Model Context Procotol (2024) being a notable example.
Memory
The information retrieved from a tool updates the agent's memory, which also contains the context of its objective: the original user-defined objective, any other user input, as well as the results of prior planning, actions, and reflection. This memory is vital for the agent to complete its objective coherently and not be stuck in endless loops, conducting unnecessary actions, or offering irrelevant results. Memory does not have to be one continuous block. It can be separated into multiple sections with different persistence and frequencies of use. This is the case with LlamaIndex's composable memory.
LLM
The underlying LLM powering the agent's every move can be any language model, including the notable models listed in the pocket references of notable models (e.g., Llama-3, DeepSeek-R1, etc.).
Framework
The final component of an agent that cannot be overlooked is the "connective tissue" that enables the LLM to work together with its memory and its tools. This is the code that structures the user-defined objective and the tool descriptions into a prompt for the LLM, that parses an LLM's output and directs their JSON/code to the corresponding tool, that incorporates tool results into memory and text to properly give back to the LLM for reflection. Some popular frameworks include CrewAI, MetaGPT, smolagents, LangGraph, LlamaIndex.
Applications
The potential applications for agents are quite broad. Some examples include:
- Personal scheduling assistant
- Customer service queue specialist
- Internet-of-things (IoT) hub manager
- Discussion forum moderator
- Lab research assistant
- etc.
While the above are theoretical applications, tangible agents have also begun making their way to the market. At the time of writing this reference, Deep Research (2025), a research assistant for synthesizing literature, and Manus (2025), for broader analysis and development, are two agents that have made headlines. Furthermore, Casper et al. (2025) have compiled an index to track existing agents and discover patterns. They have found agents "being deployed at a steadily increasing rate".
Limitations
Built on LLMs, agents can be computationally expensive, and thus may be unnecessary for overly simplistic tasks. Due to their semi-supervised nature, the possibility of an agent making inefficient LLM calls or worse, getting stuck in feedback loops, should be kept in mind. Extra care should be taken when implementing multi-agent collaboration, as a hallucination in one agent would affect the whole system. Additionally with multi-agents, if they were all built on the same LLM, any reasoning deficiency would be shared across all agents. Lastly, more work and time are needed to foster community trust in the viability of agents for everyday life.
Advances have been made to address these perceived shortcomings, with the foremost technique being fine-tuning. Supervised fine-tuning with instructions (Zhang et al., 2024), alignment and safety fine-tuning (Raschka, 2023) , and reasoning fine-tuning with reinforcement learning (Luong et al., 2024), are among the promising fine-tuning techniques proposed.
Further Reading
While there is general agreement on what agents do in the industry, the specifics vary between sources and application settings. Therefore, it would be useful to browse different definitions and see more perspectives. The links below include introductions to agents from Hugging Face , LlamaIndex , IBM , MIT , and Anthropic.
References & Useful Links
- Anthropic. "Introducing the Model Context Protocol." Anthropic News, anthropic.com/news/model-context-protocol. Accessed 14 Mar. 2025.
- LlamaIndex. "Single composable memory." LlamaIndex Documentation, docs.llamaindex.ai/en/stable/examples/agent/memory/composable_memory/. Accessed 14 Mar. 2025.
- OpenAI. "Introducing deep research." OpenAI Documentation, openai.com/index/introducing-deep-research/. Accessed 8 Mar. 2025.
- Manus. "Introducing Manus." Manus Home Page, manus.im/. Accessed 8 Mar. 2025.
- Casper, S., Bailey, L., Hunter, R., Ezell, C., Cabalé, E., Gerovitch, M., Slocum, S., Wei, K., Jurkovic, N., Khan, A., Christoffersen, P.J.K., Ozisik, A.P., Trivedi, R., Hadfield-Menell, D., and Kolt, N. "The AI Agent Index." (2025), DOI: 10.48550/arXiv.2502.01635.
- Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., and Wang, G. "Instruction tuning for large language models: a survey." (2024), DOI: 10.48550/arXiv.2308.10792.
- Raschka, S. "LLM training: RLHF and its alternatives." (2023), Ahead of AI, magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives. Accessed 14, Mar. 2025.
- Luong, T.Q., Zhang, X., Jie, Z., Sun, P., Jin, X., and Li, H. "ReFT: reasoning with reinforced fine-tuning." (2024), DOI: doi.org/10.48550/arXiv.2401.08967.
- Hugging Face. "What is an agent?" Hugging Face Agents Course, huggingface.co/learn/agents-course/unit1/what-are-agents. Accessed 8 Mar. 2025.
- Fajardo, V.A. LlamaIndex GenAI Philippines talk, github.com/nerdai/ talks/blob/main/2024/genai-philippines/genai-philippines.ipynb. Accessed 8 Mar. 2025.
- IBM. "What are AI agents?" IBM Think, ibm.com/think/topics/ai-agents. Accessed 8 Mar. 2025.
- Anthropic. "Building effective agents." Anthropic Engineering, anthropic.com/engineering/building-effective-agents. Accessed 14 Mar. 2025.
GitHub Repositories for Agent Frameworks
- crewAI. github.com/crewAIInc/crewAI
- MetaGPT. github.com/geekan/MetaGPT
- smolagents. github.com/huggingface/smolagents
- LangGraph. github.com/langchain-ai/langgraph
- LlamaIndex. github.com/run-llama/llama_index
RAG
Intro and Motivation for RAG
After an LLM has been pre-trained, its learning is captured in parametric knowledge. This speak is jargon simply implying that the knowledge is captured in the LLM's weight parameters. If the LLM is further fine-tuned for improved instruction following or alignment, these knowledge specializations are parametric in nature (i.e., since these involve weight parameter updates).
However, researchers have observed that relying only on the LLM's parametric knowledge, can be suboptimal and this is especially observed when performing knowledge-intensive tasks. Some pundits have argued that long-tail knowledge is not easily captured in parametric form.
To remedy this drawback of an LLM's parametric knowledge, we can consider providing an LLM with non-parametric knowledge. Retrieval-Augmented Generation (RAG) is one such technique that aims to provide knowledge in the form of additional context to an LLM at inference time. As it's name suggests, this method involves retrieving facts (i.e., knowledge) from a data store and augmenting (e.g., by string concatenation) the original prompt or query to the LLM with these facts.
Components of a RAG System
A RAG system is comprised of three main components, namely:
- Knowledge Store — contains non-parametric knowledge facts that the system can use at inference time in order to produce more accurate responses to queries.
- Retriever — a model that takes in a user query and retrieves the most relevant knowledge facts from the knowledge store. (NOTE: the retriever is also used to populate or index the knowledge store during setup.)
- Generator — a model that takes in the user's query and additional context and provides a response to that query.
Canonical RAG Pipeline
The canonical pipeline for RAG is as follows:
- User submits a query to the RAG system
- [Retrieval Step] The RAG system matches the query with the relevant facts from the knowledge store. The top k matched facts are retrieved.
- [Generation Step] The content of the retrieved facts are used to augment the query and subsequently pass to the generator.
- Response is returned back to the user. (Post-processing steps may be applied to the raw result from the generator prior to returning it to the user.)
Evaluation of RAG Systems
Evaluation of RAG systems is often not a trivial task. A common way that these systems are evaluated are by the evaluation of the respective components, namely: retriever and generator evaluation.
Evaluation of Retriever
Retriever's are evaluated based on the correctness of the retrieved facts. Given a "labelled" example containing the query as well as associated facts, we can compute metrics such as hit rate and normalized discounted cumulative gain (NDCG). The former computes the fraction of retrievals that returned the correct knowledge artifact over the number of queries (or retrieval tasks). While hit rate doesn't take into account the order in which knowledge facts are retrieved, NDCG incorporates this ordering in its calculation, considering retrievals successful when the correct knowledge artifacts appear in the highest-ranked positions.
Evaluation of Generator
Generator responses can be done via human scoring where a human assess the response to the query given the context. The human can provide a numerical score to indicate how well the generator answers the query with the provided context. Metrics such as faithfulness and accuracy are often computed. However, human marking is expensive and thus another strategy makes use of LLMs (i.e., LLM As A Judge) to perform the grading.
Limitations
While RAG has demonstrated success in providing LLMs with sufficient context in order to perform well across various knowledge-intensive benchmarks, building a RAG system involves many system-level parameters, and tuning these to achieve sufficient performance is non-trivial.
Examples of these system-level parameters include:
- On representing knowledge (i.e., knowledge store setup)
- chunk size — when populating the knowledge store, texts are chunked in order to ensure that queries along with context are within the context windows of the LLM generator
- hierarchical representations — knowledge facts may depend on one another or may contain levels of hierarchy that should be captured in the knowledge store. Advance knowledged representations via knowledge graphs are also an option but come with its own challenges to reach a satisfactory performance (i.e., how to setup the knowledge graph optimally).
- On retrieval
- matching query to knowledge facts — the raw user query may need some processing in order to increase the chances of finding relevant facts from the knowledge store. (e.g., query re-write or agentic planning)
- On generation
- hallucinations — in the event that there are no retrieved facts, there are still risks for LLM hallucinations.
Advanced Techniques
In this section, we present a few advanced techniques for building RAG systems. Generally speaking, advanced methods aim to address the two main requirements for success of a RAG system, namely:
- Retrieval must be able to find the most relevant knowledge facts for the user query.
- Generation must be able to make good use of the retrieved knowledge facts.
Advanced techniques can be viewed as addressing one of these requirements or both simultaneously. Examples include individual fine-tuning of embedding or LLM model in order improve retrieval and generation, alone. However, dual fine-tuning of these can be considered to address both requirements simultaneously. See the cheat sheet below for more advanced RAG designs.
Frameworks
RAG Inference Frameworks
There are a handful of popular open-source frameworks that exist that help to build RAG systems on top of the user's own data sources. These frameworks are useful for quick proto-typing of RAG systems, supporting both basic and advanced designs. Another major advantage of these frameworks is their vast integrations to other tools in the tech stack i.e., vector stores, llm providers (both closed and open options), embedding models, observability tools, etc.
Three popular frameworks for RAG include:
- LlamaIndex - https://github.com/run-llama/llama_index
- LangChain - https://github.com/langchain-ai/langchain
- Haystack by Deepset - https://github.com/deepset-ai/haystack
RAG Finetuning Frameworks
The previous section mentioned frameworks that are effective for building RAG inference systems. For fine-tuning RAG, under both centralized and federated settings, the Vector Institute has developed, fedRAG: https://github.com/VectorInstitute/fed-rag.
The fedRAG framework features a lightweight interface to turn a centralized model training pipeline into a federated task. Additionally, it boasts integrations to popular deep-learning tools and frameworks including PyTorch and HuggingFace.
References & Useful Links
- Liu, Jerry. "LlamaIndex." GitHub, Nov. 2022. DOI: 10.5281/zenodo.1234. github.com/run-llama/llama_index
- A Cheat Sheet and Some Recipes For Building Advanced RAG." LlamaIndex Blog, Andrei Fajardo, 5 Jan. 2024, medium.com/llamaindex-blog/a-cheat-sheet-and-some-recipes-for-building-advanced-rag.
- Rag Bootcamp. GitHub, Vector Institute, github.com/VectorInstitute/rag_bootcamp.
- Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks." Advances in neural information processing systems 33 (2020): 9459-947
Quantization
1. Introduction to Quantization
Quantization reduces the numerical precision of the weights and activations of a neural network from a high-precision datatype to a lower-precision datatype. The standard datatype for neural network weights and activations is fp32 (float32), but quantization methods enable use of lower-precision representations, most commonly 8-bit integers (int8) and 16-bit floats (float16 or bfloat16). Even 4-bit integers see use. Think of it like compressing a high-definition image into a lower-resolution format – you lose some detail, but gain efficiency.
Why Quantization?
- Lower Memory Requirements: Reduced bit-depth directly translates to smaller model memory footprint, saving memory capacity and bandwidth during model storage and inference. This also makes it feasible to deploy larger, more performant models on resource-constrained devices.
- Faster Throughput: Lower precision is significantly faster and more energy-efficient on modern hardware. Modern GPUs with tensor cores (like NVIDIA's Ampere and Hopper architectures) can perform low and mixed-precision matrix multiplications with significantly higher throughput than FP32.
2. How Quantization Works
This mapping is typically done using affine quantization or symmetric quantization. For example, an affine quantization from fp32 to int8 involves two parameters for each tensor being quantized:
- Scale (S): A positive floating-point number that determines the step size between quantized integer values. It essentially scales the integer range back to the floating-point range.
- Zero-point (Z): An integer that represents the floating-point value 0 in the quantized integer space. This is important for accurately representing zero values, which are common in neural networks (e.g., after ReLU activations).
The relationship between a floating-point value (\(x\)) and its quantized integer representation (\(x_q\)) is defined by:
$$x = S (x_q - Z).$$
To quantize a float value \(x\) to its integer representation \(x_q\), we solve:
$$x_q = \text{round}\left(\frac{x}{S} + Z\right).$$
Values outside the representable range of the target type (e.g., [-128, 127] for int8) are typically clipped to the nearest representable value. Symmetric quantization is a simplified version where the zero-point (Z) is forced to be 0. This is achieved by choosing a symmetric range around zero for quantization (e.g., [-max_abs_value, +max_abs_value]).
3. Types of Quantization & Calibration
To quantize a model, we need not only quantize the tensors holding the weights, but also the tensors holding the activations — otherwise the computation mix data types. There are different quantization approaches, each with its trade-offs in terms of performance, and implementation complexity. In all cases, we need to determine the range of values for the weights and activations. This is known as calibration. Calibrating weights is straightforward, because they are static, but calibrating activations is more challenging because they are data dependent:
-
Dynamic Quantization
- How it works: Weights are quantized when loading the model. Activations are quantized dynamically just before compute operations and de-quantized back to high-precision afterwards.
- Pros: Easiest to implement and reduces model weight memory footprint.
- Cons: Activations are still stored and passed between layers in high-precision format, limiting memory bandwidth and throughput benefits.
-
Static Quantization
- How it works: Both weights and activations are quantized. Static quantization requires a calibration step to determine the ranges for activation tensors. This is done by running a representative dataset through the high-precision model to collect activation statistics and compute the activation ranges. These statistics are then used to calculate the (S, Z) parameters before run-time.
- Pros: Better throughput than dynamic quantization as both weights and activations are quantized.
- Cons: Requires collection of a calibration dataset and offline calibration step.
4. Limitations and Considerations
- Performance Degradation: Quantization, especially to very low-precision datatypes, can lead to some loss of performance. Careful evaluation is needed for peformance-sensitive applications.
- Hardware and Operator Support: Quantization support is not universal. It depends on the target hardware and the deep learning framework. Not all operators might have efficient quantized implementations on all platforms. Make sure to verify your framework and hardware documentation for compatibility.
References & Useful Links
- Jacob, Benoit, et al. "Quantization and training of neural networks for efficient integer-arithmetic-only inference", CVPR 2018
- PyTorch Quantization Documentation
KV Cache
With autoregressive models like decoder-only LLMs (i.e., GPTs), inference is performed by predicting one token at a time, using the past token generations as inputs for future ones. To predict future tokens, certain computed representations of these past tokens are required for every future token prediction. This makes it computationally inefficient to recalculate these representations at each token generation step."
To formalize this, let \(x_1,x_2, \ldots, x_{t-1}\) represent the input sequence of \(h\) dimensional embeddings i.e., \(x_i \in R^{1\times h}\). For simplicity, let's consider a single Attention head and a single Transformer block. In order to get the logits for the next token, the LLM must compute the contextualized vector \(c_{t-1}\) given by:
$$ \begin{aligned} c_{t-1} &= f_{attn}(x_1, x_2, \ldots, x_{t-1}; W_k, W_v, W_q), \end{aligned} $$
where \(f_{attn}(\cdot)\) is the attention operator that produces a contextualized vector using all of the input embedding vectors, and \(W_k\), \(W_v\) and \(W_q\) are the \(h\times h\) projection matrices for keys, values, and queries, respectively. (Note that the Attention module computes the contextualized vectors of all input embeddings simultaneously, employing causal masking to ensure that each token only attends to itself and previous tokens in the sequence.)
Recall that with the attention operator, we first need to compute the various keys and values representations of the input embeddings as well as the query representation of \(x_{t-1}\):
$$ K_{t-1} = \begin{bmatrix} x_1 W_k \\ x_2 W_k \\ \vdots \\ x_{t-1} W_k \end{bmatrix}, \quad V_{t-1} = \begin{bmatrix} x_1 W_v \\ x_2 W_v \\ \vdots \\ x_{t-1} W_v \end{bmatrix}, \quad \text{and} \quad q_{t-1} = x_{t-1}W_q. $$
Using scaled-dot attention, we combine the keys with the query to derive an attention weights vector via:
$$ a_{t-1} = \text{Softmax}(q_{t-1} K_{t-1}^T / \sqrt{h}). $$
Finally, the contextualized vector of the (\(t-1)\)-th token is the attention-weighted combination of the values vectors:
$$ c_{t-1} = a_{t-1} V_{t-1}. $$
The LLM ultimately makes use of this contexualized vector to build the logits for the \(t\)-th token prediction. Let's suppose that \(x_t\) is generated from such logits.
With \(x_t\) generated, we aim to predict the next token. To do so, we now need to build the contextualized vector, \(c_t\):
$$ \begin{aligned} c_t &= f_{attn}(x_1, x_2, \ldots, x_{t-1}, x_t; W_k, W_v, W_q), \end{aligned} $$
As before, we understand that in order to apply this operator, the following keys, values and query are required:
$$ K_{t} = \begin{bmatrix} x_1 W_k \\ x_2 W_k \\ \vdots \\ x_{t-1} W_k \\ x_t W_k \end{bmatrix}, \quad V_{t} = \begin{bmatrix} x_1 W_v \\ x_2 W_v \\ \vdots \\ x_{t-1} W_v \\ x_t W_v \end{bmatrix}, \quad \text{and} \quad q_t = x_{t}W_q. $$
It immediately follows though that
$$ K_{t} = \begin{bmatrix} K_{t-1} \\ x_t W_k \end{bmatrix} \quad \text{and} \quad V_{t} = \begin{bmatrix} V_{t-1} \\ x_t W_v \end{bmatrix}. $$
In other words, the keys and values required to build \(c_t\) consist of all the previous keys and values needed for \(c_{t-1}\) plus only the new key and value derived from the latest input embedding token \(x_t\).
This insight presents an opportunity to significantly reduce computational overhead during generation by caching and reusing past keys and values rather than recomputing them.
This is exactly the purpose of having a KV Cache. At each iteration of inference, we compute the newest key and value emanating from the latest input embedding token and add it to the respective caches, one for each keys and values.
Algorithm: KV Cache for Autoregressive Inference
Pre-fill Stage
Given input sequence \(x_1, x_2, \ldots, x_n\)
Initialize key cache \(K_n = [x_1W_k; x_2W_k; \ldots; x_nW_k]\)
Initialize value cache \(V_n = [x_1W_v; x_2W_v; \ldots; x_nW_v]\)Decode Stage
Loop for each token generation step t > n:
\(\quad\)Compute new key and value: \(k_t = x_t W_k\), \(v_t = x_t W_v\)
\(\quad\)Update caches by appending new key and value:
\(\qquad\)\(K_t = [K_{t-1}; k_t]\)
\(\qquad\)\(V_t = [V_{t-1}; v_t]\)
\(\quad\)Compute attention using cached keys and values:
\(\qquad\)\(q_t = x_t W_q\)
\(\qquad\)\(c_t = \text{Softmax}(q_t K_t^T / \sqrt{h}) V_t\)
\(\quad\)Compute next token logits using \(c_t\)
\(\quad\)Generate \(x_{t+1}\) // (which becomes part of the next step's input)
Limitations
Note that LLMs use Multi-Head Attention modules with several Transformer layers. Each attention head would need to maintain its own KV Cache and as such, the memory requirements can be quite expensive. As one example, Liu et al (2024) note that with 540B PaLM, using a batch size of 512 and context length of 2048, would required a KV Cache that can take upwards of 3TB of memory — more than the amount of memory required to hold the model's weight parameters.
This memory bottleneck becomes especially pronounced when serving multiple requests simultaneously or working with very long context windows.
References & Useful Links
- Liu, Zirui, et al. "Kivi: A tuning-free asymmetric 2bit quantization for kv cache." arXiv preprint arXiv:2402.02750 (2024).
- Raschka, Sebastian. Build a Large Language Model (From Scratch). Simon and Schuster, 2024.
- Rajan, R. "KV Cache - Understanding the Mechanism behind it." R4J4N Blogs, r4j4n.github.io/blogs/posts/kv/. Accessed 27 Feb. 2025.
Notable Models
In this section, we provide Pocket References to a few of the storied and important LLMs to date. Examples include DeepSeek's R1 and V3 models as well as models from Meta, Google, Alibaba etc.
If you'd like to see another model covered in this section, please do submit a New Pocket Reference Issue to our Github project.
DeepSeek-R1
The DeepSeek-R1 model was introduced by DeepSeek in January of 2025. It is derived from an earlier checkpoint of DeepSeek-V3. In particular, starting with DeepSeek-V3-base, four stages of fine-tuning were performed in order to arrive at the checkpoint known as DeepSeek-R1: (i) Reasoning Cold-Start (using SFT), (ii) RL for Reasoning (using GRPO), (iii) SFT for Enhanced Reasoning & General Capabilities (using RL-generated reasoning data sampled with Rejection Sampling), and (iv) RL for Alignment (to human preferences).
Figure: Illustrating DeepSeek-R1 model evolution.
As illustrated in the Figure above, the model lineage of DeepSeek-R1 implements a full-scale RL for reasoning stage that leverages cold-start data. In contrast, DeepSeek-R1-Zero does not use any cold-start SFT data whatsoever and uses purely RL steps to acquire its reasoning capabilities. The reward signal used for guiding the RL process of DeepSeek-R1-Zero is rules based computed from the response's correctness as well as its adherence to the desired format. While DeepSeek-R1-Zero demonstrated remarkable reasoning capabilities, it suffered greatly from poor readability and language mixing.
This motivated the usage of cold-start data in the RL for Reasoning stage of DeepSeek-R1's training. Additionally, a reward signal to reduce language mixing as well as a model-based reward (using DeepSeek-V3 for judgement) was also incorporated.
Historical Significance
At the time of its release, LLM reasoning models such as the OpenAI's o-series models had demonstrated remarkable performance on complex tasks, including those requiring multiple steps (e.g., OpenAI o3's breakthrough score on ARC-AGI). However, OpenAI—operating under a closed-source model—provided little details to how these models were developed, merely mentioning that Reinforcement Learning techniques were used to train the LLMs to produce long (internal) chain-of-thought style reasoning prior to providing a final answer.
In contrast, DeepSeek open-sourced DeepSeek-R1 and provided a very detailed technical report, shedding much light on its training pipeline, which included an RL approach for the model to acquire its reasoning capabilities. It was also reported that DeepSeek-R1 was trained on NVIDIA H800's, a less capable GPU than the NVIDIA H100 or A100.
DeepSeek-V3 is trained on a cluster equipped with 2048 NVIDIA H800 GPUs.
(quoted from the DeepSeek-V3 Technical Report)
The fact that DeepSeek-R1's performance rivaled that of it's closed-source counterpart in OpenAI o3 on multiple benchmarks (using reportedly less compute) led to a frenzy in the LLM and broader AI community. As an example, many teams (including at least one from HuggingFace) worked tirelessly to produce their own versions of DeepSeek-R1 in the days after its release.
Architectural Highlights
See DeepSeek-V3.
Training Data
The training data used for the four stages are described below:
Reasoning Cold Start: 1000s of samples of long CoT passages from multiple domains, verified by human annotators was used.
RL for Reasoning: self-exploration, using increased test-time for RL discovery until convergence (referred to as the RL checkpoint).
SFT for Enhanced Reasoning & General Capabilities: the RL checkpoint was then used to generate 600K reasoning related samples (using rejection sampling). DeepSeek-V3 was used to create 200K non-reasoning data omitting the CoT portion for simple queries.
RL for Alignment: a combination of reward signals diverse data distributions including preference pairs and analyses of generated summaries & responses.
Key Results
Below are three key results of DeepSeek-R1 and its development:
Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4 0513 | DeepSeek-V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek-R1 |
---|---|---|---|---|---|---|
MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | 91.8 | 90.8 |
MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | 92.9 |
MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | 84.0 |
DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | 92.2 |
IF-Eval (Prompt Strict) | 86.5 | 84.3 | 86.1 | 84.8 | - | 83.3 |
GFQA Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | 75.7 | 71.5 |
SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | 47.0 | 30.1 |
FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - | 82.5 |
AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - | 87.6 |
ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | 92.3 |
LiveCodeBench (Pass@1-COT) | 38.9 | 32.9 | 36.2 | 53.8 | 63.4 | 65.9 |
Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | 96.6 | 96.3 |
Codeforces (Rating) | 717 | 759 | 1134 | 1820 | 2061 | 2029 |
SWE Verified (Resolved) | 50.8 | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 |
Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | 61.7 | 53.3 |
AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | 79.8 |
MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | 97.3 |
CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | 78.8 |
CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | 92.8 |
C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | 91.8 |
C-SimpleQA (Correct) | 55.4 | 58.7 | 68.0 | 40.3 | - | 63.7 |
Table: Comparison between DeepSeek-R1 and other representative models. (Copied from Table 4 of Guo, Daya, et al (2025).)
-
Performance on Benchmarks: The table above which was copied from the DeepSeek-R1 paper compares the performance of DeepSeek-R1 and -V3 with representative models from Anthropic and OpenAI. The values reported clearly demonstrate the impressive performance of DeepSeek-R1 across various benchmarks and tasks. Most notably, DeepSeek-R1 was able to surpass OpenAI's reasoning model o1-1217 on several benchmarks.
-
Distilling Reasoning Capabilities: The 800K samples that included generated examples by both DeepSeek-R1 (reasoning) and DeepSeek-V3 (non-reasoning) were used to distill other open-source models like Qwen and Llama. With only the application SFT (i.e., no RL), some of these distilled models were not only able to outperform OpenAI's non-reasoning model GPT-4o-0513 across all benchmarks tested, but also OpenAI's o1-mini model on most benchmarks.
-
RL's Potential: Pure RL empowered DeepSeek-R1-Zero to autonomously acquire robust reasoning capabilities without any SFT data. What's more is that as test-time computation was increased, desirable behaviours such as reflection and re-evaluation on past trajectories emerged making it possible for the model to have "aha moments" when solving complex tasks. This development should serve as a reminder of the great potential of RL and its overall place in AI as endeavour to reach new heights.
Limitations
DeepSeek reported various limitations for DeepSeek-R1. Most notably, DeepSeek-R1 is inferior to DeepSeek-V3 in general capabilities such as function calling, producing structured outputs (JSON), role-playing, and multi-turn conversations. Additionally, due to its optimization for English and Chinese, the model sometimes suffers from language mixing. Lastly, DeepSeek-R1 reportedly demonstrated a high sensitivity to prompts and long inference times, making it unsuitable for low-latency applications such as software-engineering tasks.
References & Useful Links
- Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).
- Liu, Aixin, et al. "Deepseek-v3 technical report." arXiv preprint arXiv:2412.19437 (2024).
- China's DeepSeek sets off Nvidia investor panic over US export controls (appearing in fortune.com)
- Open-R1: a fully open reproduction of DeepSeek-R1 (by HuggingFace)
- DeepSeek-R1 is available on HuggingFace
DeepSeek-v3
The DeepSeek-V3 model was introduced by DeepSeek in December of 2024. It is an LLM that leverages MoE in its design.
The training pipeline for DeepSeek-V3 consists of the two typical stages: pre-training and post-training. As depicted in the Figure above, the pre-training stage involves pre-training on 14.8T tokens followed by long-context extension using the YaRN methodology. Post-training of DeepSeek-V3 utilizes SFT as well as Reinforcement Learning methods.
Historical Significance
At the time of its release, open-source models had already been lessening the gap in performance with closed-source counterparts. DeepSeek-V3 was yet another open-source model that achieved high levels of performance, beating other open-source alternatives as well as some closed-source models in various benchmarks. What made DeepSeek-V3's achievement even more intriguing was that it was reportedly trained using less compute than its closest counterparts.
Architectural Highlights
DeepSeek-V3 is a transformer-based model that swaps out nearly all dense feedforward for MoE. The model has a total of 671B parameters but through its specialized variant of MoE (referred to as DeepSeekMoE), only 37B parameters are activated in both training and inference. Through a series of long-context extension fine-tuning steps, the maximum context length for this model was extended to 128K tokens.
DeepSeekMoE: Used to carry out training more efficiently, this MoE design consists of two sets of experts, namely: shared and routed. The former set of routers is used for every token in the input sequence whereas the usage of routed ones are determined according to the affinity to the input token.
Auxiliary-loss Load Free Balancing: When using an MoE architecture, one must consider load balancing across the networks to prevent routing collapse. This has been typically addressed via the introduction of an auxiliary loss. However, if this loss has too great of an influence, it can lead to a model degradation. DeepSeek-V3 instead considers a technique that requires no auxiliary loss but instead relies on a new bias term that dynamically changes its value according to the experts current workload.
Multi-Head Latent Attention (MLA): Used for making inference more efficient by jointly compressing attention keys and values to a lower dimension. The compression involves a linear projection matrix compressing keys and values down as well as another linear project matrix for compressing keys and values back up. Only the compressed joint representation of keys and values need to be cached during inference. For more details see MLA.
Multi-Token Prediction: In an effort to improve the training signal, DeepSeek-V3 expands the prediction scope to additional future tokens at every token position of the sequence. In other words, instead of only predicting the next immediate token and training the model on this signal, $D$ future tokens are predicted. These tokens are predicted sequentially by $D$ sequential multi-token prediction modules in order to maintain the causal chain. For more details see MTP.
Parameter | Value |
---|---|
Total parameters | 671B |
Activated parameters | 37B |
Maximum context length | 128K tokens |
Number of Transformer layers | 61 |
Hidden dimension size | 7168 |
Number of attention heads | 128 |
Number of experts (MoE) | 1 (shared) & 256 (routed) |
Hidden dimension of experts | 2048 |
KV compression dimension size (MLA) | 512 |
Multi-token depth (MTP) | 1 |
Training Data
The pre-training corpus is a revised version of the one used to train an earlier version of the model, DeepSeek-V2. In this revision, more samples pertaining to mathematics and programming were included. Ultimately, the dataset comprised of 14.8T tokens.
Compute Details
DeepSeek-V3 was trained on a cluster with 2048 NVIDIA H800 GPUs. Each node within the cluster consists of 8 H800 GPUs inter-connected via NVLink and NVSwitch. In total, it was reported that only 2.664M H800 GPU hours were used for pre-training while subsequent training stages required only 0.1M GPU hours. One of the main reasons for this training efficiency was their application of an FP8 mixed precision training framework.
Key Results
Benchmark (Metric) | # Shots | DeepSeek-V2 Base | Qwen2.5 72B Base | LLaMA-3.1 405B Base | DeepSeek-V3 Base |
---|---|---|---|---|---|
Pile-test (BPB) | - | 0.606 | 0.638 | 0.542 | 0.548 |
BBH (EM) | 3-shot | 78.8 | 79.8 | 82.9 | 87.5 |
MMLU (EM) | 5-shot | 78.4 | 85.0 | 84.4 | 87.1 |
MMLU-Redux (EM) | 5-shot | 75.6 | 83.2 | 81.3 | 86.2 |
MMLU-Pro (EM) | 5-shot | 51.4 | 58.3 | 52.8 | 64.4 |
DROP (F1) | 3-shot | 80.4 | 80.6 | 86.0 | 89.0 |
ARC-Easy (EM) | 25-shot | 97.6 | 98.4 | 98.4 | 98.9 |
ARC-Challenge (EM) | 25-shot | 92.2 | 94.5 | 95.3 | 95.3 |
HellaSwag (EM) | 10-shot | 87.1 | 84.8 | 89.2 | 88.9 |
PIQA (EM) | 0-shot | 83.9 | 82.1 | 85.9 | 84.7 |
WinoGrande (EM) | 5-shot | 86.3 | 82.3 | 85.2 | 84.9 |
RACE-Middle (EM) | 3-shot | 73.1 | 68.1 | 74.2 | 74.9 |
RACE-High (EM) | 5-shot | 52.6 | 50.3 | 56.8 | 51.3 |
TriviaQA (EM) | 5-shot | 80.0 | 71.9 | 82.7 | 82.9 |
NaturalQuestions (EM) | 5-shot | 38.6 | 33.2 | 41.5 | 40.0 |
AGIEval (EM) | 0-shot | 57.5 | 75.8 | 60.6 | 79.6 |
HumanEval (Pass@1) | 0-shot | 43.3 | 53.0 | 54.9 | 65.2 |
MBPP (Pass@1) | 3-shot | 65.0 | 72.6 | 68.4 | 75.4 |
LiveCodeBench-Base (Pass@1) | 3-shot | 11.6 | 12.9 | 15.1 | 19.4 |
CRUXEval-1 (EM) | 2-shot | 52.5 | 59.1 | 58.5 | 67.3 |
CRUXEval-O (EM) | 2-shot | 49.8 | 59.9 | 59.9 | 69.8 |
CSMRR (EM) | 8-shot | 81.6 | 88.3 | 89.3 | 89.3 |
MATH (EM) | 4-shot | 43.4 | 54.4 | 49.0 | 61.6 |
MGSM (EM) | 8-shot | 63.6 | 76.2 | 69.9 | 79.8 |
CMath (EM) | 3-shot | 78.7 | 84.5 | 77.3 | 90.7 |
CLUEWSC (EM) | 5-shot | 82.0 | 82.5 | 83.0 | 82.7 |
C-Eval (EM) | 0-shot | 81.4 | 72.5 | 72.5 | 90.1 |
CMMLU (EM) | 5-shot | 84.0 | 89.5 | 73.7 | 88.8 |
CMRC (EM) | 1-shot | 77.4 | 75.8 | 76.0 | 76.3 |
C3 (EM) | 0-shot | 77.4 | 76.7 | 79.7 | 78.6 |
CCPM (EM) | 0-shot | 93.0 | 88.5 | 78.6 | 92.0 |
MMLU-non-English (EM) | 5-shot | 64.0 | 74.8 | 73.8 | 79.4 |
-
Superior Open-Source Model: DeepSeek-V3 outperformed all other open-source models on educational benchmarks (MMLU, MMLU-Pro, GPQA) achieving performance levels that rivals that for closed-source models such as GPT-4o and Claude-Sonnet-3.5. DeepSeek-V3 also achieved SOTA on math-related benchmarks (GSM8K, MATH, MGSM, CMath).
-
Efficient Training: DeepSeek-V3 was trained using only 2.664M H800 GPU hours, leveraging an FP8 mixed precision training framework. This marked, as reported by the authors, the first successful use of an FP8 scheme to train a large-scale model.
-
Reasoning Distillation: As part of the post-training step, DeepSeek-V3 creators were able to distill reasoning capabilities via long CoT passages generated by DeepSeek-R1. The authors noted that this pipeline improved reasononing performance while still maintaining the ability to produce desired outputs and efficient response lengths.
Limitations
DeepSeek-V3 requires significant amounts of computation facilities to ensure efficient inference.
References & Useful Links
- Liu, Aixin, et al. "Deepseek-v3 technical report." arXiv preprint arXiv:2412.19437 (2024).
- DeepSeek sparks AI stock selloff; Nvidia posts record market-cap loss (appearing in reuters.com)