Benchmark a RAG System¶
In this quick start guide, we'll demonstrate how to leverage the evals
module
within the fed-rag
library to benchmark your RAGSystem
. For conciseness, we
won't cover the detailed process of assembling a RAGSystem
here—please refer to
our other quick start guides for comprehensive instructions on system assembly.
The Benchmarker
Class¶
Within the evals
module is a core class called Benchmarker
.
It bears the responsibility of running a benchmark for your RAGSystem
.
from fed_rag.evals import Benchmarker
benchmarker = Benchmarker(rag_system=rag_system) # (1)!
- Your previously assembled
RAGSystem
Importing a Benchmark
to Run¶
The evals
module contains various benchmarks that can be used to evaluate a
RAGSystem
. A FedRAG benchmark contains BenchmarkExamples
that carry the query, response, and context for any given example.
Inspired by how datasets are imported from familiar libraries like torchvision
,
the benchmarks can be imported as follows:
From here, we can choose to use any of the defined benchmarks! The snippet below
makes use of the HuggingFaceMMLU
benchmark.
mmlu = benchmarks.HuggingFaceMMLU(streaming=True) # (1)!
# get the example stream
examples_stream = mmlu.as_stream()
print(next(examples_stream)) # will yield the next BenchmarkExample for MMLU
- The HuggingFace benchmarks integration supports the underlying streaming mechanism of
~datasets.Dataset
.
Info
Using a HuggingFace supported benchmark, requires the huggingface-evals
extra.
This can be installed via pip-install fed-rag[huggingface-evals]
. Note that
the more comprehensive huggingface
extra also includes all necessary packages
for huggingface-evals
.
Choosing your Evaluation Metric¶
To run a benchmark, you must also supply a EvaluationMetric
.
The code snippet below imports the ExactMatchEvaluationMetric
.
from fed_rag.evals.metrics import ExactMatchEvaluationMetric
metric = ExactMatchEvaluationMetric()
# using the metric
metric(prediction="A", acutal="a") # case in-sensitive returns 1.0
Info
All subclasses of BaseEvaluationMetric
, like ExactMatchEvaluationMetric
are callable. We can see the signature of this method by using the help builtin
i.e., help(metric.__call__)
.
Running the Benchmark¶
We now have all the elements in place in order to run the benchmark. To do so,
we invoke the run()
method of the Benchmarker
object, passing in the elements
we defined in previous sections.
result = benchmark.run(
benchmark=mmlu,
metric=metric,
is_streaming=True,
num_examples=3, # (1)!
agg="avg", # (2)!
)
print(result)
- (Optional) useful for rapid testing of your benchmark rig.
- Can be 'avg', 'sum', 'max', 'min', see
AggregationMode
A successful run of a benchmark will result in a BenchmarkResult
object that contains summary information about the benchmark including the final
aggregated score, the number of examples used, as well as the total number of examples
that the benchmark contains.