(NOTE: if running on Colab, you will need to supply a HFToken. Also, you'll need to change the runtime to a L4.)

Benchmarking RAG Systems with FedRAG and Gemma 3n¶

In this note book, we demonstrate how one benchmark a RAG system with the fed-rag library with Google’s Gemma 3n model on the MMLU dataset.

We will walk through the following steps:

Build your RAGSystem to be benchmarked (with a retriever, knowledge store, and LLM generator)
Create a Benchmarker object to manage evaluation runs
Choose your Benchmark (MMLU) and run it using the Benchmarker

In this notebook, we'll make use of the huggingface-evals extra which will allow us to utilize the benchmarks defined in the fed_rag.evals.benchmarks.huggingface module.

Requirements:

A Hugging Face account with access to Gemma 3n, and your Hugging Face API token (see cell below for setup).
(Colab) Set your runtime to a GPU instance with sufficient RAM (L4 or better recommended).

Install dependencies¶

In [ ]:

Copied!

# Install the latest fed-rag, transformers, and timm.
!pip install fed-rag[huggingface,huggingface-evals]

# Update libraries needed for Gemma3n.
!pip install --upgrade transformers timm
# Install the latest fed-rag, transformers, and timm.
!pip install fed-rag[huggingface,huggingface-evals]

# Update libraries needed for Gemma3n.
!pip install --upgrade transformers timm

Build the RAG System¶

Knowledge Store and Retriever¶

In [ ]:

Copied!





from fed_rag.knowledge_stores.in_memory import InMemoryKnowledgeStore
from fed_rag.retrievers.huggingface.hf_sentence_transformer import (
    HFSentenceTransformerRetriever,
)

knowledge_store = InMemoryKnowledgeStore()

retriever = HFSentenceTransformerRetriever(
    query_model_name="nthakur/dragon-plus-query-encoder",
    context_model_name="nthakur/dragon-plus-context-encoder",
    load_model_at_init=False,
)
from fed_rag.knowledge_stores.in_memory import InMemoryKnowledgeStore
from fed_rag.retrievers.huggingface.hf_sentence_transformer import (
    HFSentenceTransformerRetriever,
)

knowledge_store = InMemoryKnowledgeStore()

retriever = HFSentenceTransformerRetriever(
    query_model_name="nthakur/dragon-plus-query-encoder",
    context_model_name="nthakur/dragon-plus-context-encoder",
    load_model_at_init=False,
)

Let's Add Some Knowledge¶

In [3]:

Copied!





# a small sample from the Dec 2021 Wikipedia dump
text_chunks = [
    {
        "id": "140",
        "title": "History of marine biology",
        "section": "James Cook",
        "text": " James Cook is well known for his voyages of exploration for the British Navy in which he mapped out a significant amount of the world's uncharted waters. Cook's explorations took him around the world twice and led to countless descriptions of previously unknown plants and animals. Cook's explorations influenced many others and led to a number of scientists examining marine life more closely. Among those influenced was Charles Darwin who went on to make many contributions of his own. ",
    },
    {
        "id": "141",
        "title": "History of marine biology",
        "section": "Charles Darwin",
        "text": " Charles Darwin, best known for his theory of evolution, made many significant contributions to the early study of marine biology. He spent much of his time from 1831 to 1836 on the voyage of HMS Beagle collecting and studying specimens from a variety of marine organisms. It was also on this expedition where Darwin began to study coral reefs and their formation. He came up with the theory that the overall growth of corals is a balance between the growth of corals upward and the sinking of the sea floor. He then came up with the idea that wherever coral atolls would be found, the central island where the coral had started to grow would be gradually subsiding",
    },
    {
        "id": "142",
        "title": "History of marine biology",
        "section": "Charles Wyville Thomson",
        "text": " Another influential expedition was the voyage of HMS Challenger from 1872 to 1876, organized and later led by Charles Wyville Thomson. It was the first expedition purely devoted to marine science. The expedition collected and analyzed thousands of marine specimens, laying the foundation for present knowledge about life near the deep-sea floor. The findings from the expedition were a summary of the known natural, physical and chemical ocean science to that time.",
    },
]
# a small sample from the Dec 2021 Wikipedia dump
text_chunks = [
    {
        "id": "140",
        "title": "History of marine biology",
        "section": "James Cook",
        "text": " James Cook is well known for his voyages of exploration for the British Navy in which he mapped out a significant amount of the world's uncharted waters. Cook's explorations took him around the world twice and led to countless descriptions of previously unknown plants and animals. Cook's explorations influenced many others and led to a number of scientists examining marine life more closely. Among those influenced was Charles Darwin who went on to make many contributions of his own. ",
    },
    {
        "id": "141",
        "title": "History of marine biology",
        "section": "Charles Darwin",
        "text": " Charles Darwin, best known for his theory of evolution, made many significant contributions to the early study of marine biology. He spent much of his time from 1831 to 1836 on the voyage of HMS Beagle collecting and studying specimens from a variety of marine organisms. It was also on this expedition where Darwin began to study coral reefs and their formation. He came up with the theory that the overall growth of corals is a balance between the growth of corals upward and the sinking of the sea floor. He then came up with the idea that wherever coral atolls would be found, the central island where the coral had started to grow would be gradually subsiding",
    },
    {
        "id": "142",
        "title": "History of marine biology",
        "section": "Charles Wyville Thomson",
        "text": " Another influential expedition was the voyage of HMS Challenger from 1872 to 1876, organized and later led by Charles Wyville Thomson. It was the first expedition purely devoted to marine science. The expedition collected and analyzed thousands of marine specimens, laying the foundation for present knowledge about life near the deep-sea floor. The findings from the expedition were a summary of the known natural, physical and chemical ocean science to that time.",
    },
]

In [4]:

Copied!





from fed_rag.data_structures import KnowledgeNode, NodeType

# create knowledge nodes
nodes = []
texts = []
for c in text_chunks:
    text = c.pop("text")
    title = c.pop("title")
    section = c.pop("section")
    context_text = f"title: {title}\nsection: {section}\ntext: {text}"
    texts.append(context_text)

# batch encode
batch_embeddings = retriever.encode_context(texts)

for jx, c in enumerate(text_chunks):
    node = KnowledgeNode(
        embedding=batch_embeddings[jx].tolist(),
        node_type=NodeType.TEXT,
        text_content=texts[jx],
        metadata=c,
    )
    nodes.append(node)
from fed_rag.data_structures import KnowledgeNode, NodeType

# create knowledge nodes
nodes = []
texts = []
for c in text_chunks:
    text = c.pop("text")
    title = c.pop("title")
    section = c.pop("section")
    context_text = f"title: {title}\nsection: {section}\ntext: {text}"
    texts.append(context_text)

# batch encode
batch_embeddings = retriever.encode_context(texts)

for jx, c in enumerate(text_chunks):
    node = KnowledgeNode(
        embedding=batch_embeddings[jx].tolist(),
        node_type=NodeType.TEXT,
        text_content=texts[jx],
        metadata=c,
    )
    nodes.append(node)

In [ ]:

Copied!

# load nodes
knowledge_store.load_nodes(nodes)
print("Knowledge nodes loaded:", knowledge_store.count)
# load nodes
knowledge_store.load_nodes(nodes)
print("Knowledge nodes loaded:", knowledge_store.count)

Define an LLM Generator¶

In [ ]:

Copied!

from huggingface_hub import notebook_login

# This will prompt you to login to huggingface
notebook_login()
from huggingface_hub import notebook_login

# This will prompt you to login to huggingface
notebook_login()

In [ ]:

Copied!





from fed_rag.generators.huggingface.hf_multimodal_model import (
    HFMultimodalModelGenerator,
)

generator = HFMultimodalModelGenerator(
    model_name="google/gemma-3n-e2b-it",  # can be any HF model name or path
    load_model_at_init=True,  # defaults True, loads model on init
)
from fed_rag.generators.huggingface.hf_multimodal_model import (
    HFMultimodalModelGenerator,
)

generator = HFMultimodalModelGenerator(
    model_name="google/gemma-3n-e2b-it",  # can be any HF model name or path
    load_model_at_init=True,  # defaults True, loads model on init
)

If running on CPU and see OOM errors, switch device accordingly. (L4 is recommended if available)

Assemble the RAG System¶

In [9]:

Copied!





from fed_rag import RAGSystem, RAGConfig

rag_config = RAGConfig(top_k=2)
rag_system = RAGSystem(
    knowledge_store=knowledge_store,  # knowledge store loaded from knowledge_store.py
    generator=generator,
    retriever=retriever,
    rag_config=rag_config,
)
from fed_rag import RAGSystem, RAGConfig

rag_config = RAGConfig(top_k=2)
rag_system = RAGSystem(
    knowledge_store=knowledge_store,  # knowledge store loaded from knowledge_store.py
    generator=generator,
    retriever=retriever,
    rag_config=rag_config,
)

In [ ]:

Copied!

# test a query
response = rag_system.query("Who is James Cook?")
print(response)
# test a query
response = rag_system.query("Who is James Cook?")
print(response)

Perform Benchmarking¶

Create `Benchmarker`¶

In [ ]:

Copied!

from fed_rag.evals import Benchmarker
from fed_rag.evals import Benchmarker

In [ ]:

Copied!

benchmarker = Benchmarker(rag_system=rag_system)
benchmarker = Benchmarker(rag_system=rag_system)

Get the desired Benchmark (MMLU)¶

For this notebook, we'll use a HuggingFace benchmark, namely the MMLU one. The recommended pattern for loading benchmarks from fed_rag is illustrated in the cells found below.

In [ ]:

Copied!

import fed_rag.evals.benchmarks as benchmarks

# define the mmlu benchmark
mmlu = benchmarks.HuggingFaceMMLU(streaming=True)
import fed_rag.evals.benchmarks as benchmarks

# define the mmlu benchmark
mmlu = benchmarks.HuggingFaceMMLU(streaming=True)

In the above, we set streaming to True since the underlying dataset is quite large. By doing so, we can get a stream of ~fed_rag.data_structures.BenchmarkExample that we can process.

In [ ]:

Copied!

example_stream = mmlu.as_stream()
next(example_stream)
example_stream = mmlu.as_stream()
next(example_stream)

In [16]:

Copied!

example_stream.close()  # close the stream
example_stream.close()  # close the stream

Define our Evaluation Metric¶

In this notebook, we'll make use of the ExactMatchEvaluationMetric.

In [17]:

Copied!

from fed_rag.evals.metrics import ExactMatchEvaluationMetric

metric = ExactMatchEvaluationMetric()
from fed_rag.evals.metrics import ExactMatchEvaluationMetric

metric = ExactMatchEvaluationMetric()

All BaseEvaluationMetric are direcly callable (i.e., their special __call__ methods are implemented). We can see the signature of this method by using the help builtin.

In [ ]:

Copied!

help(metric.__call__)
help(metric.__call__)

Exact match is case insensitive.

In [19]:

Copied!

metric(prediction="A", actual="A")  # scores 1
metric(prediction="A", actual="A")  # scores 1

Out[19]:

1.0

In [20]:

Copied!

metric(prediction="A", actual="a")  # also scores 1
metric(prediction="A", actual="a")  # also scores 1

Out[20]:

1.0

In [21]:

Copied!

metric(prediction="A", actual="b")  # scores 0
metric(prediction="A", actual="b")  # scores 0

Out[21]:

0.0

Multiple-Choice Prompt: Default and Customization¶

fed-rag sets a default prompt for MMLU so that LLMs like Gemma 3n answer with only a single letter (A, B, C, or D). This keeps the evaluation automatic and reliable.

In [ ]:

Copied!





mmlu.prompt_template = """
{question}
A. {A}
B. {B}
C. {C}
D. {D}

You MUST answer ONLY with a single letter (A, B, C, or D). Do NOT provide any explanation, reasoning, or extra text. Answers with any additional text are considered incorrect.

Example of a correct answer:
B

Example of an incorrect answer:
B. Because...

Answer:
"""
mmlu.prompt_template = """
{question}
A. {A}
B. {B}
C. {C}
D. {D}

You MUST answer ONLY with a single letter (A, B, C, or D). Do NOT provide any explanation, reasoning, or extra text. Answers with any additional text are considered incorrect.

Example of a correct answer:
B

Example of an incorrect answer:
B. Because...

Answer:
"""

Customizing the prompt: If you want to experiment with different prompting strategies, just assign your own prompt to mmlu.prompt_template above to override the default prompt.

Run the benchmark!¶

In [ ]:

Copied!





result = benchmarker.run(
    benchmark=mmlu,
    metric=metric,
    is_streaming=True,
    num_examples=3,  # for quick testing only run it on 3 examples
    agg="avg",  # can be 'avg', 'sum', 'max', 'min'
    save_evaluations=True,  # needs fed-rag v0.0.23 or above
)
result = benchmarker.run(
    benchmark=mmlu,
    metric=metric,
    is_streaming=True,
    num_examples=3,  # for quick testing only run it on 3 examples
    agg="avg",  # can be 'avg', 'sum', 'max', 'min'
    save_evaluations=True,  # needs fed-rag v0.0.23 or above
)

In [ ]:

Copied!

print(result)
print(result)

Load evaluated examples¶

In [ ]:

Copied!

from fed_rag.evals.utils import load_evaluations

evaluations = load_evaluations(result.evaluations_file)

print("Score for first example:", evaluations[0].score)
print("Model output for second example:", evaluations[1].rag_response.response)
print("Prompt template used:\n", mmlu.prompt_template)
from fed_rag.evals.utils import load_evaluations

evaluations = load_evaluations(result.evaluations_file)

print("Score for first example:", evaluations[0].score)
print("Model output for second example:", evaluations[1].rag_response.response)
print("Prompt template used:\n", mmlu.prompt_template)

Benchmarking RAG Systems with FedRAG and Gemma 3n¶

Install dependencies¶

Build the RAG System¶

Knowledge Store and Retriever¶

Let's Add Some Knowledge¶

Define an LLM Generator¶

Assemble the RAG System¶

Perform Benchmarking¶

Create Benchmarker¶

Get the desired Benchmark (MMLU)¶

Define our Evaluation Metric¶

Multiple-Choice Prompt: Default and Customization¶

Run the benchmark!¶

Load evaluated examples¶

Create `Benchmarker`¶