Benchmarking RAG Systems with FedRAG and Gemma 3n¶
In this note book, we demonstrate how one benchmark a RAG system with the fed-rag
library with Google’s Gemma 3n model on the MMLU dataset.
We will walk through the following steps:
- Build your
RAGSystem
to be benchmarked (with a retriever, knowledge store, and LLM generator) - Create a
Benchmarker
object to manage evaluation runs - Choose your
Benchmark (MMLU)
and run it using theBenchmarker
In this notebook, we'll make use of the huggingface-evals
extra which will allow us to utilize the benchmarks defined in the fed_rag.evals.benchmarks.huggingface
module.
Requirements:
A Hugging Face account with access to Gemma 3n, and your Hugging Face API token (see cell below for setup).
(Colab) Set your runtime to a GPU instance with sufficient RAM (L4 or better recommended).
Install dependencies¶
# Install the latest fed-rag, transformers, and timm.
!pip install fed-rag[huggingface,huggingface-evals]
# Update libraries needed for Gemma3n.
!pip install --upgrade transformers timm
Build the RAG System¶
Knowledge Store and Retriever¶
from fed_rag.knowledge_stores.in_memory import InMemoryKnowledgeStore
from fed_rag.retrievers.huggingface.hf_sentence_transformer import (
HFSentenceTransformerRetriever,
)
knowledge_store = InMemoryKnowledgeStore()
retriever = HFSentenceTransformerRetriever(
query_model_name="nthakur/dragon-plus-query-encoder",
context_model_name="nthakur/dragon-plus-context-encoder",
load_model_at_init=False,
)
Let's Add Some Knowledge¶
# a small sample from the Dec 2021 Wikipedia dump
text_chunks = [
{
"id": "140",
"title": "History of marine biology",
"section": "James Cook",
"text": " James Cook is well known for his voyages of exploration for the British Navy in which he mapped out a significant amount of the world's uncharted waters. Cook's explorations took him around the world twice and led to countless descriptions of previously unknown plants and animals. Cook's explorations influenced many others and led to a number of scientists examining marine life more closely. Among those influenced was Charles Darwin who went on to make many contributions of his own. ",
},
{
"id": "141",
"title": "History of marine biology",
"section": "Charles Darwin",
"text": " Charles Darwin, best known for his theory of evolution, made many significant contributions to the early study of marine biology. He spent much of his time from 1831 to 1836 on the voyage of HMS Beagle collecting and studying specimens from a variety of marine organisms. It was also on this expedition where Darwin began to study coral reefs and their formation. He came up with the theory that the overall growth of corals is a balance between the growth of corals upward and the sinking of the sea floor. He then came up with the idea that wherever coral atolls would be found, the central island where the coral had started to grow would be gradually subsiding",
},
{
"id": "142",
"title": "History of marine biology",
"section": "Charles Wyville Thomson",
"text": " Another influential expedition was the voyage of HMS Challenger from 1872 to 1876, organized and later led by Charles Wyville Thomson. It was the first expedition purely devoted to marine science. The expedition collected and analyzed thousands of marine specimens, laying the foundation for present knowledge about life near the deep-sea floor. The findings from the expedition were a summary of the known natural, physical and chemical ocean science to that time.",
},
]
from fed_rag.data_structures import KnowledgeNode, NodeType
# create knowledge nodes
nodes = []
texts = []
for c in text_chunks:
text = c.pop("text")
title = c.pop("title")
section = c.pop("section")
context_text = f"title: {title}\nsection: {section}\ntext: {text}"
texts.append(context_text)
# batch encode
batch_embeddings = retriever.encode_context(texts)
for jx, c in enumerate(text_chunks):
node = KnowledgeNode(
embedding=batch_embeddings[jx].tolist(),
node_type=NodeType.TEXT,
text_content=texts[jx],
metadata=c,
)
nodes.append(node)
# load nodes
knowledge_store.load_nodes(nodes)
print("Knowledge nodes loaded:", knowledge_store.count)
Define an LLM Generator¶
from huggingface_hub import notebook_login
# This will prompt you to login to huggingface
notebook_login()
from fed_rag.generators.huggingface.hf_multimodal_model import (
HFMultimodalModelGenerator,
)
generator = HFMultimodalModelGenerator(
model_name="google/gemma-3n-e2b-it", # can be any HF model name or path
load_model_at_init=True, # defaults True, loads model on init
)
If running on CPU and see OOM errors, switch device accordingly. (L4 is recommended if available)
Assemble the RAG System¶
from fed_rag import RAGSystem, RAGConfig
rag_config = RAGConfig(top_k=2)
rag_system = RAGSystem(
knowledge_store=knowledge_store, # knowledge store loaded from knowledge_store.py
generator=generator,
retriever=retriever,
rag_config=rag_config,
)
# test a query
response = rag_system.query("Who is James Cook?")
print(response)
Perform Benchmarking¶
Create Benchmarker
¶
from fed_rag.evals import Benchmarker
benchmarker = Benchmarker(rag_system=rag_system)
Get the desired Benchmark (MMLU)¶
For this notebook, we'll use a HuggingFace benchmark, namely the MMLU one. The recommended pattern for loading benchmarks from fed_rag
is illustrated in the cells found below.
import fed_rag.evals.benchmarks as benchmarks
# define the mmlu benchmark
mmlu = benchmarks.HuggingFaceMMLU(streaming=True)
In the above, we set streaming
to True
since the underlying dataset is quite large. By doing so, we can get a stream of ~fed_rag.data_structures.BenchmarkExample
that we can process.
example_stream = mmlu.as_stream()
next(example_stream)
example_stream.close() # close the stream
Define our Evaluation Metric¶
In this notebook, we'll make use of the ExactMatchEvaluationMetric
.
from fed_rag.evals.metrics import ExactMatchEvaluationMetric
metric = ExactMatchEvaluationMetric()
All BaseEvaluationMetric
are direcly callable (i.e., their special __call__
methods are implemented). We can see the signature of this method by using the help
builtin.
help(metric.__call__)
Exact match is case insensitive.
metric(prediction="A", actual="A") # scores 1
1.0
metric(prediction="A", actual="a") # also scores 1
1.0
metric(prediction="A", actual="b") # scores 0
0.0
Multiple-Choice Prompt: Default and Customization¶
fed-rag sets a default prompt for MMLU so that LLMs like Gemma 3n answer with only a single letter (A, B, C, or D). This keeps the evaluation automatic and reliable.
mmlu.prompt_template = """
{question}
A. {A}
B. {B}
C. {C}
D. {D}
You MUST answer ONLY with a single letter (A, B, C, or D). Do NOT provide any explanation, reasoning, or extra text. Answers with any additional text are considered incorrect.
Example of a correct answer:
B
Example of an incorrect answer:
B. Because...
Answer:
"""
Customizing the prompt:
If you want to experiment with different prompting strategies, just assign your own prompt to mmlu.prompt_template
above to override the default prompt.
Run the benchmark!¶
result = benchmarker.run(
benchmark=mmlu,
metric=metric,
is_streaming=True,
num_examples=3, # for quick testing only run it on 3 examples
agg="avg", # can be 'avg', 'sum', 'max', 'min'
save_evaluations=True, # needs fed-rag v0.0.23 or above
)
print(result)
Load evaluated examples¶
from fed_rag.evals.utils import load_evaluations
evaluations = load_evaluations(result.evaluations_file)
print("Score for first example:", evaluations[0].score)
print("Model output for second example:", evaluations[1].rag_response.response)
print("Prompt template used:\n", mmlu.prompt_template)