Build a NoEncode RAG System with an MCP Knowledge Store¶
Introduction¶
In traditional RAG systems, there are three components: a retriever, a knowledge store, and a generator. A user's query is encoded by the retriever and used to retrieve relevant knowledge chunks from the knowledge store that had previously been encoded by the retriever as well. The user query along with the retrieved knowledge chunks are passed to the LLM generator to finally respond to the original query.
With NoEncode RAG systems, knowledge is still kept in a knowledge store and retrieved for responses to user queries, but there is no encoding step at all. Instead of pre-computing embeddings, NoEncode RAG systems query knowledge sources directly using natural language.
Key Differences¶
Traditional RAG:
- Documents → Embed → Vector Store
- Query → Embed → Vector Search → Retrieve → Generate
NoEncode RAG:
- Knowledge Sources (MCP servers, APIs, databases)
- Query → Direct Natural Language Query → Retrieve → Generate
NOTE: Knowledge sources may be traditional RAG systems themselves, and thus, these would involve encoding. However, the main RAG system does not handle encoding of queries or knowledge chunks at all.
Model Context Protocol (MCP)¶
MCP provides a standardized way for AI systems to connect to external tools and data sources. In our NoEncode RAG system, MCP servers act as live knowledge sources that can be queried directly with natural language. An MCP knowledge store acts as the MCP client host that creates connections to these servers and retrieves context from them.
Outline¶
In this cookbook, we will stand up two MCP knowledge sources, use them as part of an MCP knowledge store, and finally build an AsyncNoEncodeRAGSystem
that allows us to query these sources.
- MCP Knowledge Source 1: an AWS Kendra Index MCP Server
- MCP Knowledge Source 2: a LlamaCloud MCP Server
- Create an MCP Knowledge Store (using our two built sources)
- Assemble a NoEncode RAG System
MCP Knowledge Source 1: an AWS Kendra Index MCP Server¶
Here, we make use of one the myriad of officially supported AWS MCP servers offered by AWS Labs, namely: their AWS Kendra Index MCP Server.
AWS Kendra is an enterprise search service powered by machine learning. It can search across various data sources including documents, FAQs, knowledge bases, and websites, providing intelligent answers to natural language queries.
Pre-requisite Steps¶
Create a Kendra Index¶
To be able to use this MCP server, you need to create a new Kendra Index. Add a S3 data connector to it that has the RAFT paper—make sure to sync your index so that its ready to be queried with the RAFT paper. Finally, fill in the information below for regarding your Kendra Index:
# info regarding your kendra index which needs to be passed to the MCP tool call
kendra_index_info = {
"indexId": "572aca26-16be-44df-84d3-4d96d778f120",
"region": "ca-central-1",
}
Build the Kendra Index MCP Server Docker image¶
With our Kendra index in hand, we now are able to build a local MCP server that would interact with it. To do this, we take the following steps
- Clone the
awslabs/mcp
Github repo:
git clone https://github.com/awslabs/mcp.gi
- cd into the Kendra index src directory:
cd mcp/src/amazon-kendra-index-mcp-server
- Locally build the Docker image
docker build -t awslabs/amazon-kendra-index-mcp-server .
Configure AWS Credentials¶
Create a .env
file in the same directory as this notebook, with your AWS credentials:
# .env file
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
Build the MCP Stdio Knowledge Source¶
import os
from mcp import StdioServerParameters
from fed_rag.knowledge_stores.no_encode import MCPStdioKnowledgeSource
server_params = StdioServerParameters(
command="docker",
args=[
"run",
"--rm",
"--interactive",
"--init", # important to have in Jupyter Notebook
"--env-file",
f"{os.getcwd()}/.env",
"awslabs/amazon-kendra-index-mcp-server:latest",
],
)
mcp_source = MCPStdioKnowledgeSource(
name="awslabs.amazon-kendra-index-mcp-server",
server_params=server_params,
tool_name="KendraQueryTool",
query_param_name="query",
tool_call_kwargs=kendra_index_info,
)
Let's test out our new MCP source by invoking the retrieve()
method with a specific query. This will return an ~mcp.CallToolResult
object.
call_tool_result = await mcp_source.retrieve("What is RAFT?")
Converting MCP Call Tool Results to Knowledge Nodes¶
MCP tool results are automatically converted to KnowledgeNode
objects using a default converter in MCPStdioKnowledgeSource
. This generic converter works for basic use cases but may not extract all valuable information from server-specific responses. Implement a custom converter to optimize knowledge extraction for your particular MCP server. Let's see the default converter in action first and determine if we need to create our own converter function.
# using the default converter function
knowledge_nodes = mcp_source.call_tool_result_to_knowledge_nodes_list(
call_tool_result
)
print("Number of knowledge nodes created: ", len(knowledge_nodes), "\n")
print(
"Text content of first created node:\n",
knowledge_nodes[0].text_content[:500],
)
Number of knowledge nodes created: 1 Text content of first created node: {"query": "What is RAFT?", "total_results_count": 4, "results": [{"id": "a7439424-ac28-4f3c-9d94-f1c041e57541-8698aea8-20ac-49f2-9a79-298992bd1bd9", "type": "ANSWER", "document_title": "raft.pdf", "document_uri": "https://fed-rag-mcp-cookbook.s3.ca-central-1.amazonaws.com/raft.pdf", "score": "HIGH", "excerpt": "In this paper, we present Retrieval Augmented\nFine Tuning (RAFT), a training recipe which improves the model\u2019s ability\nto answer questions in \"open-book\" in-domain settings. In t
As we can see, this is not really the most ideal conversion. We should probably only pass the excerpt
text content to the LLM generator. Thus, we should define our own converter function to extract only the text content.
According to the source code for this server, we see that a successful tool call of KendraQueryTool
will return a CallToolResult
whose text
attribute is a JSON string containing a results
key. The value for results
is a list of result_items
each containing an excerpt
field, which is ultimately what we want to pass to the LLM generator.
Let's create a custom converter function to do this now.
import json
import re
from typing import Any
from mcp.types import CallToolResult
from fed_rag.data_structures import KnowledgeNode
# signature of a converter function
def kendra_index_converter_fn(
result: CallToolResult, metadata: dict[str, Any] | None = None
) -> list[KnowledgeNode]:
nodes = []
for c in result.content:
if c.type == "text": # only use ~mcp.TextContent
data = json.loads(c.text)
for res in data["results"]:
# take only the content in the "excerpt" key
text_content = re.sub(r"\s+", " ", res["excerpt"].strip())
nodes.append(
# create the knowledge node
KnowledgeNode(
node_type="text",
text_content=text_content,
metadata=metadata,
)
)
return nodes
Let's test out our custom converter as a standalone function on the previously obtained call_tool_result
.
knowledge_nodes = kendra_index_converter_fn(call_tool_result)
print("Number of knowledge nodes created: ", len(knowledge_nodes), "\n")
print(
"Text content of first created node:\n",
knowledge_nodes[0].text_content[:500],
"\n",
)
print(
"Text content of second created node:\n",
knowledge_nodes[1].text_content[:500],
)
Number of knowledge nodes created: 4 Text content of first created node: In this paper, we present Retrieval Augmented Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t Text content of second created node: 3 RAFT In this section, we present RAFT, a novel way of training LLMs for domain-specific open- book exams. We first introduce the classical technique of supervised fine-tuning, followed with the key takeaways from our experiments. Then, we introduce RAFT , a modified version of general instructio
This much improved and should work better when passing down as context to the LLM generator. We can update our mcp_source
to use this converter function easily.
# update the converter function
mcp_source = mcp_source.with_converter(kendra_index_converter_fn)
MCP Knowledge Source 2: a LlamaCloud MCP Server¶
In this part of the cookbook, we'll stand up an MCP server using LlamaCloud—an enterprise solution by LlamaIndex—by following their MCP demo.
LlamaCloud provides document parsing, indexing, and retrieval capabilities. By exposing these through an MCP server, we can query processed documents directly using natural language without managing our own document processing pipeline.
Pre-requisite Steps¶
The steps below follow from the setup instructions listed in the Github repo for the llamacloud-mcp demo.
This requires the creation of a new LlamaCloud account. If you don't have one, then you create one by visiting https://cloud.llamaindex.ai/.
Create a LlamaCloud Index¶
Login to LlamaCloud with your account and navigate to "Tool" > "Index" in the left side-bar. Click the "Create Index" button to create a new index. After creating the new index, upload the RA-DIT paper.
NOTE: You'll need to supply information on your new index in the next step.
Create a local MCP server to connect with the LlamaCloud Index¶
- Clone the
llamacloud-mcp
demo Github repo:
https://github.com/run-llama/llamacloud-mcp.git
- cd into
llamacloud-mcp
directory
cd llamacloud-mcp
- Update the
mcp-server.py
from dotenv import load_dotenv
from mcp.server.fastmcp import FastMCP
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex
import os
load_dotenv()
mcp = FastMCP('llama-index-server')
@mcp.tool(name='LlamaCloudQueryTool')
def llama_index_query(query: str) -> str:
"""Search the llama-index documentation for the given query."""
index = LlamaCloudIndex(
name="<your-llamacloud-index-name>",
project_name="Default", # change this if you didn't use default project name
organization_id="<your-llamacloud-org-id>",
api_key=os.getenv("LLAMA_CLOUD_API_KEY"),
)
response = index.as_query_engine().query(query + " Be verbose and include code examples.")
return str(response)
if __name__ == "__main__":
mcp.run(transport="stdio")
- Create a
.env
file in thellamacloud-mcp
directory
# .env
LLAMA_CLOUD_API_KEY=<your-llamacloud-api-key>
OPENAI_API_KEY=<your-openai-api-key>
NOTE: This llamacloud-mcp demo builds a ~llama_index.QueryEngine()
which by default uses an OpenAI LLM.
Build the MCP Stdio Knowledge Source¶
# change this to your actual path
path_to_llamacloud_mcp = "/home/nerdai/OSS"
llama_cloud_server_params = StdioServerParameters(
command="sh",
args=[
"-c",
f"cd {path_to_llamacloud_mcp}/llamacloud-mcp && poetry install && exec poetry run python mcp-server.py",
],
)
llama_cloud_mcp_source = MCPStdioKnowledgeSource(
name="llama-index-server",
server_params=llama_cloud_server_params,
tool_name="LlamaCloudQueryTool",
query_param_name="query",
)
res = await llama_cloud_mcp_source.retrieve("What is RALT?")
res
CallToolResult(meta=None, content=[TextContent(type='text', text='RALT stands for Retrieval-Augmented Language Model. It refers to a type of language model that incorporates a retrieval mechanism to enhance its performance on various natural language processing tasks. By retrieving relevant information from a knowledge source, such as a large text corpus or a database, the language model can better understand and generate responses to queries.\n\nHere is an example of how a Retrieval-Augmented Language Model (RALT) can be implemented using Python with the Hugging Face Transformers library:\n\n```python\nfrom transformers import RagTokenizer, RagRetriever, RagTokenForGeneration\n\n# Initialize the RAG tokenizer\ntokenizer = RagTokenizer.from_pretrained("facebook/rag-token-base")\n\n# Initialize the RAG retriever\nretriever = RagRetriever.from_pretrained("facebook/rag-token-base")\n\n# Initialize the RAG token model for generation\nmodel = RagTokenForGeneration.from_pretrained("facebook/rag-token-base", retriever=retriever)\n\n# Query input\nquery = "What is the capital of France?"\n\n# Encode the query\ninput_dict = tokenizer(query, return_tensors="pt")\n\n# Generate a response using the RALT model\noutput = model.generate(input_ids=input_dict["input_ids"], attention_mask=input_dict["attention_mask"])\n\n# Decode and print the generated response\nresponse = tokenizer.decode(output[0], skip_special_tokens=True)\nprint(response)\n```\n\nIn this code snippet, we first load the RAG tokenizer, retriever, and token model for generation. We then provide a query, encode it using the tokenizer, generate a response using the RALT model, and finally decode and print the generated response. This demonstrates how a Retrieval-Augmented Language Model can be used to answer questions by leveraging retrieved information.', annotations=None)], isError=False)
knowledge_nodes = (
llama_cloud_mcp_source.call_tool_result_to_knowledge_nodes_list(res)
)
print("Number of results returned: ", len(knowledge_nodes), "\n")
print(
"Text content of first returned node:\n",
knowledge_nodes[0].text_content[:500],
)
Number of results returned: 1 Text content of first returned node: RALT stands for Retrieval-Augmented Language Model. It refers to a type of language model that incorporates a retrieval mechanism to enhance its performance on various natural language processing tasks. By retrieving relevant information from a knowledge source, such as a large text corpus or a database, the language model can better understand and generate responses to queries. Here is an example of how a Retrieval-Augmented Language Model (RALT) can be implemented using Python with the Huggin
Create an MCP Knowledge Store¶
from fed_rag.knowledge_stores.no_encode import MCPKnowledgeStore
from sentence_transformers import CrossEncoder
Define a ReRanker¶
When MCPKnowledgeStore
retrieves knowledge from multiple MCP sources, you can provide a reranker_callback
function to rank and filter results by relevance. This optimization step ensures downstream components receive only the highest-quality, most contextually relevant information for a given query.
Below we'll use a sentence_transformer.CrossEncoder
to rerank the nodes from our two MCP sources.
def reranker_callback(
nodes: list[KnowledgeNode], query: str
) -> list[tuple[float, KnowledgeNode]]:
model = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L2")
# Concatenate the query and all passages and predict the scores for the pairs [query, passage]
model_inputs = [[query, n.text_content] for n in nodes]
scores = model.predict(model_inputs)
# Sort the scores in decreasing order
results = [(score, node) for score, node in zip(scores, nodes)]
return sorted(results, key=lambda x: x[0], reverse=True)
# adding the re-ranker to the knowledge store is easy to do!
knowledge_store = (
MCPKnowledgeStore()
.add_source(mcp_source)
.add_source(llama_cloud_mcp_source)
.with_reranker(reranker_callback)
)
res = await knowledge_store.retrieve("What is RAFT?", top_k=2)
res
[(np.float32(0.91412175), KnowledgeNode(node_id='a4509097-bc9c-4a85-bbbc-85b327e07bdf', embedding=None, node_type=<NodeType.TEXT: 'text'>, text_content='RAFT stands for Retrieval-Augmented Fine-Tuning, a technique used to enhance the performance of large language models on knowledge-intensive natural language processing tasks. It involves incorporating retrieved information from external sources into the training process of language models to improve their understanding and performance on complex tasks.\n\nHere is an example of how RAFT can be implemented using Python code:\n\n```python\nfrom transformers import RAFT, AutoTokenizer\n\n# Load the RAFT model and tokenizer\nmodel = RAFT.from_pretrained(\'model_name\')\ntokenizer = AutoTokenizer.from_pretrained(\'model_name\')\n\n# Define your input text\ninput_text = "Your input text here."\n\n# Tokenize the input text\ninput_ids = tokenizer(input_text, return_tensors=\'pt\')[\'input_ids\']\n\n# Retrieve relevant information using the RAFT model\nretrieved_info = model.retrieve(input_ids)\n\n# Incorporate the retrieved information into the fine-tuning process\n# Further train your model using the retrieved information to improve performance on specific tasks\n```\n\nIn summary, RAFT is a method that leverages retrieved information to fine-tune language models, enhancing their ability to tackle knowledge-intensive tasks effectively.', image_content=None, metadata={'name': 'llama-index-server', 'tool_name': 'LlamaCloudQueryTool', 'query_param_name': 'query', 'tool_call_kwargs': {}, 'server_params': {'command': 'sh', 'args': ['-c', 'cd /home/nerdai/OSS/llamacloud-mcp && poetry install && exec poetry run python mcp-server.py'], 'env': None, 'cwd': None, 'encoding': 'utf-8', 'encoding_error_handler': 'strict'}})), (np.float32(0.7772403), KnowledgeNode(node_id='7ac7451f-0307-465f-8c25-113f2e43e9f3', embedding=None, node_type=<NodeType.TEXT: 'text'>, text_content='...Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t help in answering the question, which...', image_content=None, metadata={'name': 'awslabs.amazon-kendra-index-mcp-server', 'tool_name': 'KendraQueryTool', 'query_param_name': 'query', 'tool_call_kwargs': {'indexId': '572aca26-16be-44df-84d3-4d96d778f120', 'region': 'ca-central-1'}, 'server_params': {'command': 'docker', 'args': ['run', '--rm', '--interactive', '--init', '--env-file', '/home/nerdai/Projects/fed-rag/docs/notebooks/.env', 'awslabs/amazon-kendra-index-mcp-server:latest'], 'env': None, 'cwd': None, 'encoding': 'utf-8', 'encoding_error_handler': 'strict'}}))]
Assemble a NoEncode RAG System¶
Now that we have built our MCPKnowledgeStore
, we can assemble our NoEncode RAG system. Recall that with NoEncode RAG systems, we forego the encoding step, and thus, we don't require a retriever model as we did with traditional RAG systems—all we need is a generator!
from fed_rag.generators.huggingface import HFPretrainedModelGenerator
import torch
from transformers.generation.utils import GenerationConfig
generation_cfg = GenerationConfig(
do_sample=True,
eos_token_id=151643,
bos_token_id=151643,
max_new_tokens=2048,
top_p=0.9,
temperature=0.6,
cache_implementation="offloaded",
stop_strings="</response>",
)
generator = HFPretrainedModelGenerator(
model_name="Qwen/Qwen2.5-3B",
load_model_at_init=False,
load_model_kwargs={"device_map": "auto", "torch_dtype": torch.float16},
generation_config=generation_cfg,
)
from fed_rag import AsyncNoEncodeRAGSystem, RAGConfig
rag_config = RAGConfig(top_k=3)
rag_system = AsyncNoEncodeRAGSystem(
knowledge_store=knowledge_store,
generator=generator,
rag_config=rag_config,
)
res = await rag_system.query(query="What is RAFT?")
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
# final RAG response
print(res)
RAFT stands for Retrieval-Augmented Fine-Tuning, a technique used to enhance the performance of large language models on knowledge-intensive natural language processing tasks. It involves fine-tuning a language model with in-context retrieval augmentation, where relevant information is retrieved from external sources to assist the model in generating accurate responses. </response>
# a peak at the retrieved source nodes from MCP knowledge store
for ix, sn in enumerate(res.source_nodes):
print(
f"SOURCE NODE {ix}:\nSCORE: {sn.score}\nSOURCE: {sn.metadata['name']}\nTEXT: {sn.text_content[:500]}\n\n"
)
SOURCE NODE 0: SCORE: 0.7772402763366699 SOURCE: awslabs.amazon-kendra-index-mcp-server TEXT: ...Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t help in answering the question, which... SOURCE NODE 1: SCORE: 0.12683217227458954 SOURCE: awslabs.amazon-kendra-index-mcp-server TEXT: In this paper, we present Retrieval Augmented Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t SOURCE NODE 2: SCORE: 0.0769425481557846 SOURCE: llama-index-server TEXT: RAFT stands for Retrieval-Augmented Fine-Tuning, a technique used to enhance the performance of large language models on knowledge-intensive natural language processing tasks. It involves fine-tuning a language model with in-context retrieval augmentation, where relevant information is retrieved from external sources to assist the model in generating accurate responses. Here is an example of how RAFT can be implemented using Python code with the Hugging Face Transformers library: ```python fro
In Summary¶
In this comprehensive notebook, we covered how to bring in context from MCP servers (or sources). More specifically, we went through:
- How to build and interact with an
MCPStdioKnowledgeSource
- How to build and interact with an
MCPKnowledgeStore
that is connected to these sources - How to then assemble a NoEncode RAG system that combines the
MCPKnowledgeStore
with a chosen generator LLM - How to define and use a reranker callback to better prioritize the retrieved knowledge nodes from the multiple MCP sources
What's Next¶
After assembling the NoEncodeRAGSystem
, you can use it with any GeneratorTrainers
(e.g., HuggingFaceTrainerForRALT) to fine-tune the RAG system to better adapt to your MCP knowledge store.
from datasets import Dataset
from fed_rag.trainers.huggingface import HuggingFaceTrainerForRALT
# define a train dataset
train_dataset = Dataset.from_dict(
# examples from Commonsense QA
{
"query": [
"The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?",
"Sammy wanted to go to where the people were. Where might he go?",
],
"response": [
"ignore",
"populated areas",
],
}
)
# use a smaller generator to avoid OOM
generator = HFPretrainedModelGenerator(
model_name="Qwen/Qwen2.5-0.5B",
load_model_at_init=False,
load_model_kwargs={"device_map": "auto", "torch_dtype": torch.float16},
generation_config=generation_cfg,
)
rag_system.generator = generator
# the trainer object
generator_trainer = HuggingFaceTrainerForRALT(
rag_system=rag_system.to_sync(), # trainers only work with sync objects
train_dataset=train_dataset,
)
train_result = generator_trainer.train()
Step | Training Loss |
---|
train_result.loss
0.8815449078877767