Retriever Fine-Tuning with LSR¶
RAG systems integrate three key components: a knowledge store, a retriever model, and a generator model. Effective fine-tuning approaches should enhance not just individual components, but the cohesive performance of the entire system.
This tutorial focuses on the LM-Supervised Retriever (LSR) fine-tuning method, which was first introduced in Shi, Weijia et al. (2023)1 and later generalized in Lin, Xi Victoria et al. (2023)2. With this method the retriever is fine-tuned by leveraging signals from the language model (i.e., generator).
The LSR Method¶
The LSR method is applied over a training dataset of (query, response) pairs. It involves the computation of two probability distributions for every pair, namely: a probability distribution derived the from the retrieval scores of each of the retrieved knowledge nodes, and a probability distribution derived from the LLM generator conditioned on prompt and each of the knowledge node's context.
Mathematically, let \((q, r)\) represent the query and response, respectively of a given training example. Moreover let \(c_i\) represent the context from the \(i\)-th knowledge node retrieved by the RAG system for query, \(q\).
Retrieval probabilities¶
The retrieval probability distribution is defined by applying the softmax function to the retrieval scores:
where \(s(q, c_i)\) represents the similarity score between query, \(q\) and context, \(c_i\).
Generation (LLM) probabilities¶
Similarly, the probabilities derived from the LLM involves another application of the softmax function, but this time over the probabilities that the LLM generator produces the ground-truth response, \(r\) when given the input sequence of \(q \circ c_i\) for \(i=1,\ldots,k\), where "\(\circ\)" is the concatenation operator. That is,
where \(\tau\) is a temperature hyperparameter.
LSR training objective¶
The LSR loss is defined as the Kullback-Liebler divergence between these two probability distributions:
where \(\mathbb{E}_{{(q,r)\in\mathcal{D}_{\text{train}}}}(\cdot)\) is the expectation operator under the distribution defined by the (query, response) pairs from the training dataset \(\mathcal{D}_{\text{train}}\).
In minimizing the LSR loss, we are adapting the retriever model to assign higher scores to the knowledge nodes that increase the generator's likelihood of producing the ground-truth response, \(r\).
Notes on the FedRAG Implementation of LSR¶
In FedRAG, we implement the LSR method with the typical coordination of a data collator
and a trainer. The DataCollatorForLSR
takes a batch of (query, response) pairs
and produces the PyTorch tensors for retrieval scores as well as LLM scores for
each knowledge node retrieved by the RAGSystem
. This batch is then used by the
TrainerForLSR
which computes the LSRLoss
for the batch and then performs the
optimization step for the given training step.