Basic Starter Example¶
In this notebook, we'll build a RAGSystem
and fine-tune both the generator and retriever using the huggingface
extra.
Install dependencies¶
# If running in a Google Colab, the first attempt at installing fed-rag may fail,
# though for reasons unknown to me yet, if you try a second time, it magically works...
!pip install fed-rag[huggingface] -q
Build the RAG System¶
Knowledge Store and Retriever¶
from fed_rag.knowledge_stores.in_memory import InMemoryKnowledgeStore
from fed_rag.retrievers.huggingface.hf_sentence_transformer import (
HFSentenceTransformerRetriever,
)
knowledge_store = InMemoryKnowledgeStore()
retriever = HFSentenceTransformerRetriever(
query_model_name="nthakur/dragon-plus-query-encoder",
context_model_name="nthakur/dragon-plus-context-encoder",
load_model_at_init=False,
)
Let's Add Some Knowledge¶
# a small sample from the Dec 2021 Wikipedia dump
text_chunks = [
{
"id": "140",
"title": "History of marine biology",
"section": "James Cook",
"text": " James Cook is well known for his voyages of exploration for the British Navy in which he mapped out a significant amount of the world's uncharted waters. Cook's explorations took him around the world twice and led to countless descriptions of previously unknown plants and animals. Cook's explorations influenced many others and led to a number of scientists examining marine life more closely. Among those influenced was Charles Darwin who went on to make many contributions of his own. ",
},
{
"id": "141",
"title": "History of marine biology",
"section": "Charles Darwin",
"text": " Charles Darwin, best known for his theory of evolution, made many significant contributions to the early study of marine biology. He spent much of his time from 1831 to 1836 on the voyage of HMS Beagle collecting and studying specimens from a variety of marine organisms. It was also on this expedition where Darwin began to study coral reefs and their formation. He came up with the theory that the overall growth of corals is a balance between the growth of corals upward and the sinking of the sea floor. He then came up with the idea that wherever coral atolls would be found, the central island where the coral had started to grow would be gradually subsiding",
},
{
"id": "142",
"title": "History of marine biology",
"section": "Charles Wyville Thomson",
"text": " Another influential expedition was the voyage of HMS Challenger from 1872 to 1876, organized and later led by Charles Wyville Thomson. It was the first expedition purely devoted to marine science. The expedition collected and analyzed thousands of marine specimens, laying the foundation for present knowledge about life near the deep-sea floor. The findings from the expedition were a summary of the known natural, physical and chemical ocean science to that time.",
},
{
"id": "143",
"title": "History of marine biology",
"section": "Later exploration",
"text": " This era of marine exploration came to a close with the first and second round-the-world voyages of the Danish Galathea expeditions and Atlantic voyages by the USS Albatross, the first research vessel purpose built for marine research. These voyages further cleared the way for modern marine biology by building a base of knowledge about marine biology. This was followed by the progressive development of more advanced technologies which began to allow more extensive explorations of ocean depths that were once thought too deep to sustain life.",
},
{
"id": "144",
"title": "History of marine biology",
"section": "Marine biology labs",
"text": " In the 1960s and 1970s, ecological research into the life of the ocean was undertaken at institutions set up specifically to study marine biology. Notable was the Woods Hole Oceanographic Institution in America, which established a model for other marine laboratories subsequently set up around the world. Their findings of unexpectedly high species diversity in places thought to be inhabitable stimulated much theorizing by population ecologists on how high diversification could be maintained in such a food-poor and seemingly hostile environment. ",
},
{
"id": "145",
"title": "History of marine biology",
"section": "Exploration technology",
"text": " In the past, the study of marine biology has been limited by a lack of technology as researchers could only go so deep to examine life in the ocean. Before the mid-twentieth century, the deep-sea bottom could not be seen unless one dredged a piece of it and brought it to the surface. This has changed dramatically due to the development of new technologies in both the laboratory and the open sea. These new technological developments have allowed scientists to explore parts of the ocean they didn't even know existed. The development of scuba gear allowed researchers to visually explore the oceans as it contains a self-contained underwater breathing apparatus allowing a person to breathe while being submerged 100 to 200 feet ",
},
{
"id": "146",
"title": "History of marine biology",
"section": "Exploration technology",
"text": " the ocean. Submersibles were built like small submarines with the purpose of taking marine scientists to deeper depths of the ocean while protecting them from increasing atmospheric pressures that cause complications deep under water. The first models could hold several individuals and allowed limited visibility but enabled marine biologists to see and photograph the deeper portions of the oceans. Remotely operated underwater vehicles are now used with and without submersibles to see the deepest areas of the ocean that would be too dangerous for humans. ROVs are fully equipped with cameras and sampling equipment which allows researchers to see and control everything the vehicle does. ROVs have become the dominant type of technology used to view the deepest parts of the ocean.",
},
{
"id": "147",
"title": "History of marine biology",
"section": "Romanticization",
"text": ' In the late 20th century and into the 21st, marine biology was "glorified and romanticized through films and television shows," leading to an influx in interested students who required a damping on their enthusiasm with the day-to-day realities of the field.',
},
{
"id": "148",
"title": "Wynthryth",
"section": "",
"text": " Wynthryth of March was an early medieval saint of Anglo Saxon England. He is known to history from the Secgan Hagiography and The Confraternity Book of St Gallen. Very little is known of his life or career. However, he was associated with the town of March, Cambridgeshire, and he may have been a relative of King Ethelstan.",
},
{
"id": "149",
"title": "James M. Safford",
"section": "",
"text": " James Merrill Safford (1822–1907) was an American geologist, chemist and university professor.",
},
{
"id": "150",
"title": "James M. Safford",
"section": "Early life",
"text": " James M. Safford was born in Putnam, Ohio on August 13, 1822. He received an M.D. and a PhD. He was trained as a chemist at Yale University. He married Catherine K. Owens in 1859, and they had two children.",
},
{
"id": "151",
"title": "James M. Safford",
"section": "Career",
"text": " Safford taught at Cumberland University in Lebanon, Tennessee from 1848 to 1873. He served as a Professor of Mineralogy, Botany, and Economical Geology at Vanderbilt University in Nashville, Tennessee from 1875 to 1900. He was a Presbyterian, and often started his lessons with a prayer. He served on the Tennessee Board of Health. Additionally, he acted as a chemist for the Tennessee Bureau of Agriculture in the 1870s and 1880s. He published fifty-four books, reports, and maps.",
},
{
"id": "152",
"title": "James M. Safford",
"section": "Death",
"text": " He died in Dallas on July 2, 1907.",
},
]
From these text chunks, we can create our KnowledgeNodes
.
from fed_rag.data_structures import KnowledgeNode, NodeType
# create knowledge nodes
nodes = []
texts = []
for c in text_chunks:
text = c.pop("text")
title = c.pop("title")
section = c.pop("section")
context_text = f"title: {title}\nsection: {section}\ntext: {text}"
texts.append(context_text)
# batch encode
batch_embeddings = retriever.encode_context(texts)
for jx, c in enumerate(text_chunks):
node = KnowledgeNode(
embedding=batch_embeddings[jx].tolist(),
node_type=NodeType.TEXT,
text_content=texts[jx],
metadata=c,
)
nodes.append(node)
nodes[0].model_dump()
# load nodes
knowledge_store.load_nodes(nodes)
knowledge_store.count
Define an LLM Generator¶
from fed_rag.generators.huggingface import HFPretrainedModelGenerator
import torch
from transformers.generation.utils import GenerationConfig
generation_cfg = GenerationConfig(
do_sample=True,
eos_token_id=151643,
bos_token_id=151643,
max_new_tokens=2048,
top_p=0.9,
temperature=0.6,
cache_implementation="offloaded",
stop_strings="</response>",
)
generator = HFPretrainedModelGenerator(
model_name="Qwen/Qwen2.5-0.5B",
load_model_at_init=False,
load_model_kwargs={"device_map": "auto", "torch_dtype": torch.float16},
generation_config=generation_cfg,
)
Assemble the RAG System¶
from fed_rag import RAGSystem, RAGConfig
rag_config = RAGConfig(top_k=2)
rag_system = RAGSystem(
knowledge_store=knowledge_store, # knowledge store loaded from knowledge_store.py
generator=generator,
retriever=retriever,
rag_config=rag_config,
)
# test a query
response = rag_system.query("Who is James Cook?")
print(response)
RAG Fine-tuning¶
In this part of the notebook, we demonstrate how to fine-tune the RAGSystem
we just built and queried. To do so, we'll use a RetrieverTrainer
and a GeneratorTrainer
to fine-tune the retriever and generator, respectively.
The Train Dataset¶
Although the retriever and generator are trained independently, both follow a standardized process. The first step involves building the training dataset which are essentially examples of (query, response) pairs.
from datasets import Dataset
train_dataset = Dataset.from_dict(
# examples from Commonsense QA
{
"query": [
"The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?",
"Sammy wanted to go to where the people were. Where might he go?",
"To locate a choker not located in a jewelry box or boutique where would you go?",
"Google Maps and other highway and street GPS services have replaced what?",
"The fox walked from the city into the forest, what was it looking for?",
],
"response": [
"ignore",
"populated areas",
"jewelry store",
"atlas",
"natural habitat",
],
}
)
Retriever Fine-Tuning (LSR)¶
Here, we'll perform LM-Supervised retriever fine-tuning. For a tutorial on this trainer, see our docs.
The HuggingFaceTrainerForLSR
is a container class for a custom-built ~sentence_transformers.SentenceTransformerTrainer
that performs training of the retriever model using the LSR loss.
from fed_rag.trainers.huggingface.lsr import HuggingFaceTrainerForLSR
# the trainer object
retriever_trainer = HuggingFaceTrainerForLSR(
rag_system=rag_system,
train_dataset=train_dataset,
# training_arguments=... # Optional ~transformers.TrainingArguments
)
# raw HF trainer object
retriever_trainer.hf_trainer_obj
result = retriever_trainer.train()
result.loss
Generator Fine-tuning (RALT)¶
Here, we'll perform Retrieval-Augmented LM (Generator) fine-tuning. For a tutorial on this trainer, see our docs.
The HuggingFaceTrainerForRALT
is a container class for a custom-built ~transformers.Trainer
that performs training of the generator model using the causal language modelling task.
from fed_rag.trainers.huggingface.ralt import HuggingFaceTrainerForRALT
# the trainer object
generator_trainer = HuggingFaceTrainerForRALT(
rag_system=rag_system,
train_dataset=train_dataset,
# training_arguments=... # Optional ~transformers.TrainingArguments
)
# raw HF trainer object
generator_trainer.hf_trainer_obj
result = generator_trainer.train()
result.loss
Closing Remarks¶
In this notebook, we used a simplified example to demonstrate building and fine-tuning a RAG system with HuggingFace models.