Benchmarking RAG Systems (MMLU)¶
In this note book, we demonstrate how one benchmark a RAG system with the fed-rag
library. Doing so, involves the following steps:
- Build your
RAGSystem
to be benchmarked - Create a
Benchmarker
object - Choose your
Benchmark
and run it with theBenchMarker
In this notebook, we'll make use of the huggingface-evals
extra which will allow us to utilize the benchmarks defined in the fed_rag.evals.benchmarks.huggingface
module.
Install dependencies¶
# If running in a Google Colab, the first attempt at installing fed-rag may fail,
# though for reasons unknown to me yet, if you try a second time, it magically works...
!pip install fed-rag[huggingface,huggingface-evals] -q
Build the RAG System¶
Knowledge Store and Retriever¶
from fed_rag.knowledge_stores.in_memory import InMemoryKnowledgeStore
from fed_rag.retrievers.huggingface.hf_sentence_transformer import (
HFSentenceTransformerRetriever,
)
knowledge_store = InMemoryKnowledgeStore()
retriever = HFSentenceTransformerRetriever(
query_model_name="nthakur/dragon-plus-query-encoder",
context_model_name="nthakur/dragon-plus-context-encoder",
load_model_at_init=False,
)
Let's Add Some Knowledge¶
# a small sample from the Dec 2021 Wikipedia dump
text_chunks = [
{
"id": "140",
"title": "History of marine biology",
"section": "James Cook",
"text": " James Cook is well known for his voyages of exploration for the British Navy in which he mapped out a significant amount of the world's uncharted waters. Cook's explorations took him around the world twice and led to countless descriptions of previously unknown plants and animals. Cook's explorations influenced many others and led to a number of scientists examining marine life more closely. Among those influenced was Charles Darwin who went on to make many contributions of his own. ",
},
{
"id": "141",
"title": "History of marine biology",
"section": "Charles Darwin",
"text": " Charles Darwin, best known for his theory of evolution, made many significant contributions to the early study of marine biology. He spent much of his time from 1831 to 1836 on the voyage of HMS Beagle collecting and studying specimens from a variety of marine organisms. It was also on this expedition where Darwin began to study coral reefs and their formation. He came up with the theory that the overall growth of corals is a balance between the growth of corals upward and the sinking of the sea floor. He then came up with the idea that wherever coral atolls would be found, the central island where the coral had started to grow would be gradually subsiding",
},
{
"id": "142",
"title": "History of marine biology",
"section": "Charles Wyville Thomson",
"text": " Another influential expedition was the voyage of HMS Challenger from 1872 to 1876, organized and later led by Charles Wyville Thomson. It was the first expedition purely devoted to marine science. The expedition collected and analyzed thousands of marine specimens, laying the foundation for present knowledge about life near the deep-sea floor. The findings from the expedition were a summary of the known natural, physical and chemical ocean science to that time.",
},
]
from fed_rag.data_structures import KnowledgeNode, NodeType
# create knowledge nodes
nodes = []
texts = []
for c in text_chunks:
text = c.pop("text")
title = c.pop("title")
section = c.pop("section")
context_text = f"title: {title}\nsection: {section}\ntext: {text}"
texts.append(context_text)
# batch encode
batch_embeddings = retriever.encode_context(texts)
for jx, c in enumerate(text_chunks):
node = KnowledgeNode(
embedding=batch_embeddings[jx].tolist(),
node_type=NodeType.TEXT,
text_content=texts[jx],
metadata=c,
)
nodes.append(node)
from fed_rag.generators.huggingface import HFPretrainedModelGenerator
import torch
from transformers.generation.utils import GenerationConfig
generation_cfg = GenerationConfig(
do_sample=True,
eos_token_id=151643,
bos_token_id=151643,
max_new_tokens=2048,
top_p=0.9,
temperature=0.6,
cache_implementation="offloaded",
stop_strings="</response>",
)
generator = HFPretrainedModelGenerator(
model_name="Qwen/Qwen2.5-0.5B",
load_model_at_init=False,
load_model_kwargs={"device_map": "auto", "torch_dtype": torch.float16},
generation_config=generation_cfg,
)
# load nodes
knowledge_store.load_nodes(nodes)
knowledge_store.count
3
Define an LLM Generator¶
from fed_rag.generators.huggingface import HFPretrainedModelGenerator
import torch
from transformers.generation.utils import GenerationConfig
generation_cfg = GenerationConfig(
do_sample=True,
eos_token_id=151643,
bos_token_id=151643,
max_new_tokens=2048,
top_p=0.9,
temperature=0.6,
cache_implementation="offloaded",
stop_strings="</response>",
)
generator = HFPretrainedModelGenerator(
model_name="Qwen/Qwen2.5-0.5B",
load_model_at_init=False,
load_model_kwargs={"device_map": "auto", "torch_dtype": torch.float16},
generation_config=generation_cfg,
)
Assemble the RAG System¶
from fed_rag import RAGSystem, RAGConfig
rag_config = RAGConfig(top_k=2)
rag_system = RAGSystem(
knowledge_store=knowledge_store, # knowledge store loaded from knowledge_store.py
generator=generator,
retriever=retriever,
rag_config=rag_config,
)
# test a query
response = rag_system.query("Who is James Cook?")
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
print(response)
You are a helpful assistant. Given the user's query, provide a succinct and accurate response. If context is provided, use it in your answer if it helps you to create the most accurate response. <query> Who is James Cook? </query> <context> title: History of marine biology section: James Cook text: James Cook is well known for his voyages of exploration for the British Navy in which he mapped out a significant amount of the world's uncharted waters. Cook's explorations took him around the world twice and led to countless descriptions of previously unknown plants and animals. Cook's explorations influenced many others and led to a number of scientists examining marine life more closely. Among those influenced was Charles Darwin who went on to make many contributions of his own. title: History of marine biology section: Charles Wyville Thomson text: Another influential expedition was the voyage of HMS Challenger from 1872 to 1876, organized and later led by Charles Wyville Thomson. It was the first expedition purely devoted to marine science. The expedition collected and analyzed thousands of marine specimens, laying the foundation for present knowledge about life near the deep-sea floor. The findings from the expedition were a summary of the known natural, physical and chemical ocean science to that time. </context> <response> Assistant: James Cook was a renowned British naval explorer who made significant contributions to marine biology. His voyages of exploration led to numerous descriptions of previously unknown plants and animals. His explorations influenced many scientists, including Charles Darwin, who made many contributions to marine biology.
Create Benchmarker
¶
from fed_rag.evals import Benchmarker
benchmarker = Benchmarker(rag_system=rag_system)
Get the desired Benchmark (MMLU)¶
For this notebook, we'll use a HuggingFace benchmark, namely the MMLU one. The recommended pattern for loading benchmarks from fed_rag
is illustrated in the cells found below.
import fed_rag.evals.benchmarks as benchmarks
# define the mmlu benchmark
mmlu = benchmarks.HuggingFaceMMLU(streaming=True)
In the above, we set streaming
to True
since the underlying dataset is quite large. By doing so, we can get a stream of ~fed_rag.data_structures.BenchmarkExample
that we can process.
example_stream = mmlu.as_stream()
next(example_stream)
BenchmarkExample(query='Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.\n\nA: 0\nB: 4\nC: 2\nD: 6', response='B', context=None)
example_stream.close() # close the stream
Define our Evaluation Metric¶
In this notebook, we'll make use of the ExactMatchEvaluationMetric
.
from fed_rag.evals.metrics import ExactMatchEvaluationMetric
metric = ExactMatchEvaluationMetric()
All BaseEvaluationMetric
are direcly callable (i.e., their special __call__
methods are implemented). We can see the signature of this method by using the help
builtin.
help(metric.__call__)
Help on method __call__ in module fed_rag.evals.metrics.exact_match: __call__(prediction: str, actual: str, *args: Any, **kwargs: Any) -> float method of fed_rag.evals.metrics.exact_match.ExactMatchEvaluationMetric instance Evaluate an example prediction against the actual response.
Exact match is case insensitive.
metric(prediction="A", actual="A") # scores 1
1.0
metric(prediction="A", actual="a") # also scores 1
1.0
metric(prediction="A", actual="b") # scores 0
0.0
Run the benchmark¶
result = benchmarker.run(
benchmark=mmlu,
metric=metric,
is_streaming=True,
num_examples=3, # for quick testing only run it on 3 examples
agg="avg", # can be 'avg', 'sum', 'max', 'min'
)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
print(result)
score=0.0 metric_name='ExactMatchEvaluationMetric' num_examples_used=3 num_total_examples=14042