(NOTE: if running on Colab, you will need to supply a WandB API Key in addition to your HFToken. Also, you'll need to change the runtime to a T4.)

Using LlamaIndex for Inference¶

Introduction¶

After fine-tuning your RAG system to achieve desired performance, you'll want to deploy it for inference. While FedRAG's RAGSystem provides complete inference capabilities out of the box, you may need additional features for production deployments or want to leverage the ecosystem of existing RAG frameworks.

FedRAG offers a seamless integration into LlamaIndex through our bridges system, giving you the best of both worlds: FedRAG's fine-tuning capabilities combined with the extensive inference features of LlamaIndex.

In this example, we demonstrate how you can convert a RAGSystem to a ~llama_index.BaseManagedIndex from which you can obtain ~llama_index.QueryEngine as well as ~llama_index.Retriever.

NOTE: Streaming and async functionalities are not yet supported.

Install dependencies¶

In [1]:

Copied!

# If running in a Google Colab, the first attempt at installing fed-rag may fail,
# though for reasons unknown to me yet, if you try a second time, it magically works...
!pip install fed-rag[huggingface,llama-index] -q
# If running in a Google Colab, the first attempt at installing fed-rag may fail,
# though for reasons unknown to me yet, if you try a second time, it magically works...
!pip install fed-rag[huggingface,llama-index] -q

zsh:1: no matches found: fed-rag[huggingface,llama-index]

Setup — The RAG System¶

In [ ]:

Copied!





import torch
from transformers.generation.utils import GenerationConfig

from fed_rag import RAGSystem, RAGConfig
from fed_rag.generators.huggingface import HFPretrainedModelGenerator
from fed_rag.retrievers.huggingface import (
    HFSentenceTransformerRetriever,
)
from fed_rag.knowledge_stores import InMemoryKnowledgeStore
from fed_rag.data_structures import KnowledgeNode, NodeType


QUERY_ENCODER_NAME = "nthakur/dragon-plus-query-encoder"
CONTEXT_ENCODER_NAME = "nthakur/dragon-plus-context-encoder"
PRETRAINED_MODEL_NAME = "Qwen/Qwen3-0.6B"

# Retriever
retriever = HFSentenceTransformerRetriever(
    query_model_name=QUERY_ENCODER_NAME,
    context_model_name=CONTEXT_ENCODER_NAME,
    load_model_at_init=False,
)

# Generator
generation_cfg = GenerationConfig(
    do_sample=True,
    eos_token_id=151643,
    bos_token_id=151643,
    max_new_tokens=2048,
    top_p=0.9,
    temperature=0.6,
    cache_implementation="offloaded",
    stop_strings="</response>",
)
generator = HFPretrainedModelGenerator(
    model_name=PRETRAINED_MODEL_NAME,
    load_model_at_init=False,
    load_model_kwargs={"device_map": "auto", "torch_dtype": torch.float16},
    generation_config=generation_cfg,
)

# Knowledge store
knowledge_store = InMemoryKnowledgeStore()


# Create the RAG system
rag_system = RAGSystem(
    retriever=retriever,
    generator=generator,
    knowledge_store=knowledge_store,
    rag_config=RAGConfig(top_k=1),
)
import torch
from transformers.generation.utils import GenerationConfig

from fed_rag import RAGSystem, RAGConfig
from fed_rag.generators.huggingface import HFPretrainedModelGenerator
from fed_rag.retrievers.huggingface import (
    HFSentenceTransformerRetriever,
)
from fed_rag.knowledge_stores import InMemoryKnowledgeStore
from fed_rag.data_structures import KnowledgeNode, NodeType


QUERY_ENCODER_NAME = "nthakur/dragon-plus-query-encoder"
CONTEXT_ENCODER_NAME = "nthakur/dragon-plus-context-encoder"
PRETRAINED_MODEL_NAME = "Qwen/Qwen3-0.6B"

# Retriever
retriever = HFSentenceTransformerRetriever(
    query_model_name=QUERY_ENCODER_NAME,
    context_model_name=CONTEXT_ENCODER_NAME,
    load_model_at_init=False,
)

# Generator
generation_cfg = GenerationConfig(
    do_sample=True,
    eos_token_id=151643,
    bos_token_id=151643,
    max_new_tokens=2048,
    top_p=0.9,
    temperature=0.6,
    cache_implementation="offloaded",
    stop_strings="",
)
generator = HFPretrainedModelGenerator(
    model_name=PRETRAINED_MODEL_NAME,
    load_model_at_init=False,
    load_model_kwargs={"device_map": "auto", "torch_dtype": torch.float16},
    generation_config=generation_cfg,
)

# Knowledge store
knowledge_store = InMemoryKnowledgeStore()


# Create the RAG system
rag_system = RAGSystem(
    retriever=retriever,
    generator=generator,
    knowledge_store=knowledge_store,
    rag_config=RAGConfig(top_k=1),
)

Add some knowledge¶

In [2]:

Copied!





text_chunks = [
    "Retrieval-Augmented Generation (RAG) combines retrieval with generation.",
    "LLMs can hallucinate information when they lack context.",
]
knowledge_nodes = [
    KnowledgeNode(
        node_type="text",
        embedding=retriever.encode_context(ct).tolist(),
        text_content=ct,
    )
    for ct in text_chunks
]
knowledge_store.load_nodes(knowledge_nodes)
text_chunks = [
    "Retrieval-Augmented Generation (RAG) combines retrieval with generation.",
    "LLMs can hallucinate information when they lack context.",
]
knowledge_nodes = [
    KnowledgeNode(
        node_type="text",
        embedding=retriever.encode_context(ct).tolist(),
        text_content=ct,
    )
    for ct in text_chunks
]
knowledge_store.load_nodes(knowledge_nodes)

In [3]:

Copied!

rag_system.knowledge_store.count
rag_system.knowledge_store.count

Out[3]:

Using the Bridge¶

Converting your RAG system to a LlamaIndex object is seamless since the bridge functionality is already built into the RAGSystem class. The RAGSystem inherits from LlamaIndexBridgeMixin, which provides the to_llamaindex() method for effortless conversion.

NOTE: The to_llamaindex() method returns a FedRAGManagedIndex object, which is a custom implementation of the ~llama_index.BaseManagedIndex class.

In [4]:

Copied!





# Create a llamaindex object
index = rag_system.to_llamaindex()

# Use it like any other LlamaIndex object to get a query engine
query = "What happens if LLMs lack context?"
query_engine = index.as_query_engine()
response = query_engine.query(query)
print(response, "\n")

# Or, get a retriever
retriever = index.as_retriever()
results = retriever.retrieve(query)
for node in results:
    print(f"Score: {node.score}, Content: {node.node}")
# Create a llamaindex object
index = rag_system.to_llamaindex()

# Use it like any other LlamaIndex object to get a query engine
query = "What happens if LLMs lack context?"
query_engine = index.as_query_engine()
response = query_engine.query(query)
print(response, "\n")

# Or, get a retriever
retriever = index.as_retriever()
results = retriever.retrieve(query)
for node in results:
    print(f"Score: {node.score}, Content: {node.node}")

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

Context information is below.
---------------------
LLMs can hallucinate information when they lack context.

Retrieval-Augmented Generation (RAG) combines retrieval with generation.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: What happens if LLMs lack context?
Answer: 1. LLMs (Language Model Generators) can hallucinate information when they lack context. This means that without sufficient information or context, LLMs may generate responses that are not based on real-world facts or previous knowledge.

2. LLMs are designed to generate responses based on the input provided, but when the input is incomplete or lacks context, they may fill in the gaps with their own assumptions or generate incorrect or irrelevant information.

3. LLMs rely on their training data and the context in which it was generated to make predictions. If the context is not present or incomplete, the LLM may generate responses that are not relevant or accurate.

4. LLMs can be trained on large datasets, but they still require context to understand the meaning of the input. Without context, LLMs may generate responses that are not relevant or accurate.

5. It is important to provide context when using LLMs to ensure that the generated responses are accurate and relevant. This can be done by providing the necessary information or context to the LLM, or by using a combination of LLMs and other tools to generate responses.

Score: 0.5453173113645673, Content: Node ID: 8864707f-9fce-49f3-aa34-7370b41bfc4f
Text: LLMs can hallucinate information when they lack context.
Score: 0.5065647593667755, Content: Node ID: eb4b722f-1e0c-4434-915c-4cd9db604dba
Text: Retrieval-Augmented Generation (RAG) combines retrieval with
generation.

Modifying Knowledge¶

In addition to querying the bridged index, you can also make changes to the underlying KnowledgeStore using LlamaIndex's API:

In [5]:

Copied!





from llama_index.core.schema import Node, MediaResource

llama_nodes = [
    Node(
        embedding=[1, 1, 1],
        text_resource=MediaResource(text="some arbitrary text"),
    ),
    Node(
        embedding=[2, 2, 2],
        text_resource=MediaResource(text="some more arbitrary text"),
    ),
]
index.insert_nodes(llama_nodes)
from llama_index.core.schema import Node, MediaResource

llama_nodes = [
    Node(
        embedding=[1, 1, 1],
        text_resource=MediaResource(text="some arbitrary text"),
    ),
    Node(
        embedding=[2, 2, 2],
        text_resource=MediaResource(text="some more arbitrary text"),
    ),
]
index.insert_nodes(llama_nodes)

In [6]:

Copied!

# confirm that what we added above is indeed in the knowledge store
rag_system.knowledge_store.count
# confirm that what we added above is indeed in the knowledge store
rag_system.knowledge_store.count

Out[6]:

In [7]:

Copied!

# you can also delete nodes
index.delete_nodes(node_ids=[node.node_id for node in llama_nodes])
# you can also delete nodes
index.delete_nodes(node_ids=[node.node_id for node in llama_nodes])

In [8]:

Copied!

# confirm that what we deleted above is indeed removed from the knowledge store
rag_system.knowledge_store.count
# confirm that what we deleted above is indeed removed from the knowledge store
rag_system.knowledge_store.count

Out[8]:

Advanced Usage¶

You can combine your bridged index with LlamaIndex's advanced features:

In [9]:

Copied!





from llama_index.core.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=3
)

query_engine = index.as_query_engine(
    similarity_top_k=2, node_postprocessors=[rerank]
)

# Execute the query with the advanced configuration
response = query_engine.query("Explain the benefits of RAG systems")
print(response)
from llama_index.core.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=3
)

query_engine = index.as_query_engine(
    similarity_top_k=2, node_postprocessors=[rerank]
)

# Execute the query with the advanced configuration
response = query_engine.query("Explain the benefits of RAG systems")
print(response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.

Context information is below.
---------------------
Retrieval-Augmented Generation (RAG) combines retrieval with generation.

LLMs can hallucinate information when they lack context.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: Explain the benefits of RAG systems
Answer: 1. Retrieval-Augmented Generation (RAG) systems combine retrieval and generation to enhance the performance of LLMs. 2. RAG systems can provide more accurate and relevant results by leveraging external knowledge sources. 3. RAG systems can handle complex queries by breaking them down into smaller subqueries and combining the results. 4. RAG systems can improve the quality of generated text by providing context and guidance from external sources. 5. RAG systems can be used in various applications, such as search engines, chatbots, and virtual assistants, to provide better user experiences.

In [10]:

Copied!

response.source_nodes
response.source_nodes

Out[10]:

[NodeWithScore(node=Node(id_='eb4b722f-1e0c-4434-915c-4cd9db604dba', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Retrieval-Augmented Generation (RAG) combines retrieval with generation.', path=None, url=None, mimetype=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}'), score=-2.525338),
 NodeWithScore(node=Node(id_='8864707f-9fce-49f3-aa34-7370b41bfc4f', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='LLMs can hallucinate information when they lack context.', path=None, url=None, mimetype=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}'), score=-11.860242)]

Bridge Metadata¶

To view the metadata of the LlamaIndex bridge, you can access the class attribute bridge of the RAGSystem class, which is a dictionary object that contains the BridgeMetadata for all of the installed bridges.

In [ ]:

Copied!

# see available bridges
print(RAGSystem.bridges)

# see the LlamaIndex bridge metadata
print(RAGSystem.bridges["llama-index-core"])
# see available bridges
print(RAGSystem.bridges)

# see the LlamaIndex bridge metadata
print(RAGSystem.bridges["llama-index-core"])

{'llama-index': {'bridge_version': '0.1.0', 'framework': 'llama-index', 'compatible_versions': ['0.12.35'], 'method_name': 'to_llamaindex'}}
{'bridge_version': '0.1.0', 'framework': 'llama-index', 'compatible_versions': ['0.12.35'], 'method_name': 'to_llamaindex'}