Build a NoEncode RAG System with an MCP Knowledge Store¶
Introduction¶
In traditional RAG systems, there are three components: a retriever, a knowledge store, and a generator. A user's query is encoded by the retriever and used to retrieve relevant knowledge chunks from the knowledge store that had previously been encoded by the retriever as well. The user query along with the retrieved knowledge chunks are passed to the LLM generator to finally respond to the original query.
With NoEncode RAG systems, knowledge is still kept in a knowledge store and retrieved for responses to user queries, but there is no encoding step at all. Instead of pre-computing embeddings, NoEncode RAG systems query knowledge sources directly using natural language.
Key Differences¶
Traditional RAG:
- Documents → Embed → Vector Store
- Query → Embed → Vector Search → Retrieve → Generate
NoEncode RAG:
- Knowledge Sources (MCP servers, APIs, databases)
- Query → Direct Natural Language Query → Retrieve → Generate
NOTE: Knowledge sources may be traditional RAG systems themselves, and thus, these would involve encoding. However, the main RAG system does not handle encoding of queries or knowledge chunks at all.
Model Context Protocol (MCP)¶
MCP provides a standardized way for AI systems to connect to external tools and data sources. In our NoEncode RAG system, MCP servers act as live knowledge sources that can be queried directly with natural language. An MCP knowledge store acts as the MCP client host that creates connections to these servers and retrieves context from them.
Outline¶
In this cookbook, we will stand up two MCP knowledge sources, use them as part of an MCP knowledge store, and finally build an AsyncNoEncodeRAGSystem
that allows us to query these sources.
- MCP Knowledge Source 1: an AWS Kendra Index MCP Server
- MCP Knowledge Source 2: a LlamaCloud MCP Server
- Create an MCP Knowledge Store (using our two built sources)
- Assemble a NoEncode RAG System
MCP Knowledge Source 1: an AWS Kendra Index MCP Server¶
Here, we make use of one the myriad of officially supported AWS MCP servers offered by AWS Labs, namely: their AWS Kendra Index MCP Server.
AWS Kendra is an enterprise search service powered by machine learning. It can search across various data sources including documents, FAQs, knowledge bases, and websites, providing intelligent answers to natural language queries.
Pre-requisite Steps¶
Create a Kendra Index¶
To be able to use this MCP server, you need to create a new Kendra Index. Add a S3 data connector to it that has the RAFT paper—make sure to sync your index so that its ready to be queried with the RAFT paper. Finally, fill in the information below for regarding your Kendra Index:
# info regarding your kendra index which needs to be passed to the MCP tool call
kendra_index_info = {
"indexId": "572aca26-16be-44df-84d3-4d96d778f120",
"region": "ca-central-1",
}
Build the Kendra Index MCP Server Docker image¶
With our Kendra index in hand, we now are able to build a local MCP server that would interact with it. To do this, we take the following steps
- Clone the
awslabs/mcp
Github repo:
git clone https://github.com/awslabs/mcp.gi
- cd into the Kendra index src directory:
cd mcp/src/amazon-kendra-index-mcp-server
- Locally build the Docker image
docker build -t awslabs/amazon-kendra-index-mcp-server .
Configure AWS Credentials¶
Create a .env
file in the same directory as this notebook, with your AWS credentials:
# .env file
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
Build the MCP Stdio Knowledge Source¶
import os
from mcp import StdioServerParameters
from fed_rag.knowledge_stores.no_encode import MCPStdioKnowledgeSource
server_params = StdioServerParameters(
command="docker",
args=[
"run",
"--rm",
"--interactive",
"--init", # important to have in Jupyter Notebook
"--env-file",
f"{os.getcwd()}/.env",
"awslabs/amazon-kendra-index-mcp-server:latest",
],
)
mcp_source = MCPStdioKnowledgeSource(
name="awslabs.amazon-kendra-index-mcp-server",
server_params=server_params,
tool_name="KendraQueryTool",
query_param_name="query",
tool_call_kwargs=kendra_index_info,
)
Let's test out our new MCP source by invoking the retrieve()
method with a specific query. This will return an ~mcp.CallToolResult
object.
call_tool_result = await mcp_source.retrieve("What is RAFT?")
Converting MCP Call Tool Results to Knowledge Nodes¶
MCP tool results are automatically converted to KnowledgeNode
objects using a default converter in MCPStdioKnowledgeSource
. This generic converter works for basic use cases but may not extract all valuable information from server-specific responses. Implement a custom converter to optimize knowledge extraction for your particular MCP server. Let's see the default converter in action first and determine if we need to create our own converter function.
# using the default converter function
knowledge_nodes = mcp_source.call_tool_result_to_knowledge_nodes_list(
call_tool_result
)
print("Number of knowledge nodes created: ", len(knowledge_nodes), "\n")
print(
"Text content of first created node:\n",
knowledge_nodes[0].text_content[:500],
)
Number of knowledge nodes created: 1 Text content of first created node: {"query": "What is RAFT?", "total_results_count": 4, "results": [{"id": "fa859564-abe5-4a34-8624-1e0b2ae59b41-62e4b0af-d3dd-426b-9cbc-b141855e8397", "type": "ANSWER", "document_title": "raft.pdf", "document_uri": "https://fed-rag-mcp-cookbook.s3.ca-central-1.amazonaws.com/raft.pdf", "score": "HIGH", "excerpt": "In this paper, we present Retrieval Augmented\nFine Tuning (RAFT), a training recipe which improves the model\u2019s ability\nto answer questions in \"open-book\" in-domain settings. In t
As we can see, this is not really the most ideal conversion. We should probably only pass the excerpt
text content to the LLM generator. Thus, we should define our own converter function to extract only the text content.
According to the source code for this server, we see that a successful tool call of KendraQueryTool
will return a CallToolResult
whose text
attribute is a JSON string containing a results
key. The value for results
is a list of result_items
each containing an excerpt
field, which is ultimately what we want to pass to the LLM generator.
Let's create a custom converter function to do this now.
import json
import re
from typing import Any
from mcp.types import CallToolResult
from fed_rag.data_structures import KnowledgeNode
# signature of a converter function
def kendra_index_converter_fn(
result: CallToolResult, metadata: dict[str, Any] | None = None
) -> list[KnowledgeNode]:
nodes = []
for c in result.content:
if c.type == "text": # only use ~mcp.TextContent
data = json.loads(c.text)
for res in data["results"]:
# take only the content in the "excerpt" key
text_content = re.sub(r"\s+", " ", res["excerpt"].strip())
nodes.append(
# create the knowledge node
KnowledgeNode(
node_type="text",
text_content=text_content,
metadata=metadata,
)
)
return nodes
Let's test out our custom converter as a standalone function on the previously obtained call_tool_result
.
knowledge_nodes = kendra_index_converter_fn(call_tool_result)
print("Number of knowledge nodes created: ", len(knowledge_nodes), "\n")
print(
"Text content of first created node:\n",
knowledge_nodes[0].text_content[:500],
"\n",
)
print(
"Text content of second created node:\n",
knowledge_nodes[1].text_content[:500],
)
Number of knowledge nodes created: 4 Text content of first created node: In this paper, we present Retrieval Augmented Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t Text content of second created node: 3 RAFT In this section, we present RAFT, a novel way of training LLMs for domain-specific open- book exams. We first introduce the classical technique of supervised fine-tuning, followed with the key takeaways from our experiments. Then, we introduce RAFT , a modified version of general instructio
This much improved and should work better when passing down as context to the LLM generator. We can update our mcp_source
to use this converter function easily.
# update the converter function
mcp_source = mcp_source.with_converter(kendra_index_converter_fn)
MCP Knowledge Source 2: a LlamaCloud MCP Server¶
In this part of the cookbook, we'll stand up an MCP server using LlamaCloud—an enterprise solution by LlamaIndex—by following their MCP demo.
LlamaCloud provides document parsing, indexing, and retrieval capabilities. By exposing these through an MCP server, we can query processed documents directly using natural language without managing our own document processing pipeline.
Pre-requisite Steps¶
The steps below follow from the setup instructions listed in the Github repo for the llamacloud-mcp demo.
This requires the creation of a new LlamaCloud account. If you don't have one, then you create one by visiting https://cloud.llamaindex.ai/.
Create a LlamaCloud Index¶
Login to LlamaCloud with your account and navigate to "Tool" > "Index" in the left side-bar. Click the "Create Index" button to create a new index. After creating the new index, upload the RA-DIT paper.
NOTE: You'll need to supply information on your new index in the next step.
Create a local MCP server to connect with the LlamaCloud Index¶
- Clone the
llamacloud-mcp
demo Github repo:
https://github.com/run-llama/llamacloud-mcp.git
- cd into
llamacloud-mcp
directory
cd llamacloud-mcp
- Update the
mcp-server.py
from dotenv import load_dotenv
from mcp.server.fastmcp import FastMCP
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex
import os
load_dotenv()
mcp = FastMCP('llama-index-server')
@mcp.tool(name='LlamaCloudQueryTool')
def llama_index_query(query: str) -> str:
"""Search the llama-index documentation for the given query."""
index = LlamaCloudIndex(
name="<your-llamacloud-index-name>",
project_name="Default", # change this if you didn't use default project name
organization_id="<your-llamacloud-org-id>",
api_key=os.getenv("LLAMA_CLOUD_API_KEY"),
)
response = index.as_query_engine().query(query + " Be verbose and include code examples.")
return str(response)
if __name__ == "__main__":
mcp.run(transport="stdio")
- Create a
.env
file in thellamacloud-mcp
directory
# .env
LLAMA_CLOUD_API_KEY=<your-llamacloud-api-key>
OPENAI_API_KEY=<your-openai-api-key>
NOTE: This llamacloud-mcp demo builds a ~llama_index.QueryEngine()
which by default uses an OpenAI LLM.
Build the MCP Stdio Knowledge Source¶
# change this to your actual path
path_to_llamacloud_mcp = "/home/nerdai/OSS"
llama_cloud_server_params = StdioServerParameters(
command="sh",
args=[
"-c",
f"cd {path_to_llamacloud_mcp}/llamacloud-mcp && poetry install && exec poetry run python mcp-server.py",
],
)
llama_cloud_mcp_source = MCPStdioKnowledgeSource(
name="llama-index-server",
server_params=llama_cloud_server_params,
tool_name="LlamaCloudQueryTool",
query_param_name="query",
)
res = await llama_cloud_mcp_source.retrieve("What is RALT?")
res
CallToolResult(meta=None, content=[TextContent(type='text', text='RALT stands for Retrieval-Augmented Language Model. It is a model that combines traditional language models with a retrieval mechanism to enhance performance on various natural language processing tasks. By incorporating a retrieval component, RALT models can access external knowledge sources to improve their understanding and generation of text.\n\nHere is an example of how a RALT model can be implemented using Python and the Hugging Face Transformers library:\n\n```python\nfrom transformers import RagTokenizer, RagRetriever, RagTokenForGeneration\n\n# Initialize the RAG tokenizer\ntokenizer = RagTokenizer.from_pretrained("facebook/rag-token-base")\n\n# Initialize the retriever component\nretriever = RagRetriever.from_pretrained("facebook/rag-token-base")\n\n# Initialize the generator component\ngenerator = RagTokenForGeneration.from_pretrained("facebook/rag-token-base")\n\n# Encode the input text\ninput_text = "Question: Who is the president of the United States?"\ninput_ids = tokenizer(input_text, return_tensors="pt").input_ids\n\n# Retrieve relevant passages\nretrieved_docs = retriever(input_ids)\n\n# Generate an answer based on the retrieved passages\noutput = generator(input_ids, retrieved_docs=retrieved_docs)\n\n# Decode and print the generated answer\ngenerated_text = tokenizer.decode(output["generated_tokens"][0], skip_special_tokens=True)\nprint(generated_text)\n```\n\nIn this code snippet, the RALT model is used to answer a question by retrieving relevant passages and generating a response based on the retrieved information. The model combines the power of language modeling with the ability to retrieve external knowledge, making it effective for a wide range of NLP tasks.', annotations=None)], isError=False)
knowledge_nodes = (
llama_cloud_mcp_source.call_tool_result_to_knowledge_nodes_list(res)
)
print("Number of results returned: ", len(knowledge_nodes), "\n")
print(
"Text content of first returned node:\n",
knowledge_nodes[0].text_content[:500],
)
Number of results returned: 1 Text content of first returned node: RALT stands for Retrieval-Augmented Language Model. It is a model that combines traditional language models with a retrieval mechanism to enhance performance on various natural language processing tasks. By incorporating a retrieval component, RALT models can access external knowledge sources to improve their understanding and generation of text. Here is an example of how a RALT model can be implemented using Python and the Hugging Face Transformers library: ```python from transformers import
Create an MCP Knowledge Store¶
from fed_rag.knowledge_stores.no_encode import MCPKnowledgeStore
from sentence_transformers import CrossEncoder
Define a ReRanker¶
When MCPKnowledgeStore
retrieves knowledge from multiple MCP sources, you can provide a reranker_callback
function to rank and filter results by relevance. This optimization step ensures downstream components receive only the highest-quality, most contextually relevant information for a given query.
Below we'll use a sentence_transformer.CrossEncoder
to rerank the nodes from our two MCP sources.
def reranker_callback(
nodes: list[KnowledgeNode], query: str
) -> list[tuple[float, KnowledgeNode]]:
model = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L2")
# Concatenate the query and all passages and predict the scores for the pairs [query, passage]
model_inputs = [[query, n.text_content] for n in nodes]
scores = model.predict(model_inputs)
# Sort the scores in decreasing order
results = [(score, node) for score, node in zip(scores, nodes)]
return sorted(results, key=lambda x: x[0], reverse=True)
# adding the re-ranker to the knowledge store is easy to do!
knowledge_store = (
MCPKnowledgeStore()
.add_source(mcp_source)
.add_source(llama_cloud_mcp_source)
.with_reranker(reranker_callback)
)
res = await knowledge_store.retrieve("What is RAFT?", top_k=2)
res
[(0.7772398, KnowledgeNode(node_id='32cdd16d-4a71-450b-bbc5-46f6220b0e82', embedding=None, node_type=<NodeType.TEXT: 'text'>, text_content='...Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t help in answering the question, which...', image_content=None, metadata={'name': 'awslabs.amazon-kendra-index-mcp-server', 'tool_name': 'KendraQueryTool', 'query_param_name': 'query', 'tool_call_kwargs': {'indexId': '572aca26-16be-44df-84d3-4d96d778f120', 'region': 'ca-central-1'}, 'server_params': {'command': 'docker', 'args': ['run', '--rm', '--interactive', '--init', '--env-file', '/home/nerdai/Projects/fed-rag/docs/notebooks/.env', 'awslabs/amazon-kendra-index-mcp-server:latest'], 'env': None, 'cwd': None, 'encoding': 'utf-8', 'encoding_error_handler': 'strict'}})), (0.16266058, KnowledgeNode(node_id='ff999d42-3c8a-4be0-a65b-998dec7ad728', embedding=None, node_type=<NodeType.TEXT: 'text'>, text_content='RAFT stands for Retrieval-Augmented Fine-Tuning. It is a technique used in natural language processing that involves combining retrieval-based methods with fine-tuning of large language models. This approach aims to enhance the performance of language models on knowledge-intensive tasks by leveraging external knowledge sources through retrieval mechanisms.\n\nIn RAFT, a large language model is fine-tuned on a specific task while also incorporating information retrieved from external sources. By integrating retrieved passages into the training process, the model can access a broader range of information beyond what is present in the training data, leading to improved performance on tasks that require external knowledge.\n\nHere is a simplified example of how RAFT can be implemented using a Python code snippet:\n\n```python\n# Pseudo-code example of RAFT implementation\nfrom transformers import T5ForConditionalGeneration, T5Tokenizer\n\n# Load the large language model\nmodel = T5ForConditionalGeneration.from_pretrained(\'t5-large\')\ntokenizer = T5Tokenizer.from_pretrained(\'t5-large\')\n\n# Fine-tune the model with retrieval-augmentation\n# Incorporate retrieved passages into the training data\n\n# Use the fine-tuned model for inference\ninput_text = "Question: What is the capital of France?"\nretrieved_passage = retrieve_passage("France capital information")\ninput_text_with_retrieval = f"Background: {retrieved_passage} {input_text}"\n\ninput_ids = tokenizer(input_text_with_retrieval, return_tensors=\'pt\').input_ids\noutputs = model.generate(input_ids)\n\n# Decode the model output\ndecoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)\nprint(decoded_output)\n```\n\nIn this example, the RAFT process involves fine-tuning a T5 model with retrieval-augmentation and then using the model to generate answers to questions by incorporating retrieved passages into the input text. This allows the model to benefit from external knowledge during inference, leading to more accurate and informed responses.', image_content=None, metadata={'name': 'llama-index-server', 'tool_name': 'LlamaCloudQueryTool', 'query_param_name': 'query', 'tool_call_kwargs': {}, 'server_params': {'command': 'sh', 'args': ['-c', 'cd /home/nerdai/OSS/llamacloud-mcp && poetry install && exec poetry run python mcp-server.py'], 'env': None, 'cwd': None, 'encoding': 'utf-8', 'encoding_error_handler': 'strict'}}))]
Assemble a NoEncode RAG System¶
Now that we have built our MCPKnowledgeStore
, we can assemble our NoEncode RAG system. Recall that with NoEncode RAG systems, we forego the encoding step, and thus, we don't require a retriever model as we did with traditional RAG systems—all we need is a generator!
from fed_rag.generators.huggingface import HFPretrainedModelGenerator
import torch
from transformers.generation.utils import GenerationConfig
generation_cfg = GenerationConfig(
do_sample=True,
eos_token_id=151643,
bos_token_id=151643,
max_new_tokens=2048,
top_p=0.9,
temperature=0.6,
cache_implementation="offloaded",
stop_strings="</response>",
)
generator = HFPretrainedModelGenerator(
model_name="Qwen/Qwen2.5-3B",
load_model_at_init=False,
load_model_kwargs={"device_map": "auto", "torch_dtype": torch.float16},
generation_config=generation_cfg,
)
from fed_rag import AsyncNoEncodeRAGSystem, RAGConfig
rag_config = RAGConfig(top_k=3)
rag_system = AsyncNoEncodeRAGSystem(
knowledge_store=knowledge_store,
generator=generator,
rag_config=rag_config,
)
res = await rag_system.query(query="What is RAFT?")
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
# final RAG response
print(res)
RAFT stands for Retrieval-Augmented Fine-Tuning. It is a technique used in natural language processing that combines the power of retrieval-based methods with fine-tuning language models. This approach involves augmenting language models with retrieved information from external sources to enhance their performance on various NLP tasks. </response>
# a peak at the retrieved source nodes from MCP knowledge store
for ix, sn in enumerate(res.source_nodes):
print(
f"SOURCE NODE {ix}:\nSCORE: {sn.score}\nSOURCE: {sn.metadata['name']}\nTEXT: {sn.text_content[:500]}\n\n"
)
SOURCE NODE 0: SCORE: 0.7772397994995117 SOURCE: awslabs.amazon-kendra-index-mcp-server TEXT: ...Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t help in answering the question, which... SOURCE NODE 1: SCORE: 0.1268322467803955 SOURCE: awslabs.amazon-kendra-index-mcp-server TEXT: In this paper, we present Retrieval Augmented Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t SOURCE NODE 2: SCORE: 0.052663445472717285 SOURCE: llama-index-server TEXT: RAFT stands for Retrieval-Augmented Fine-Tuning. It is a technique used in natural language processing that combines the power of retrieval-based methods with fine-tuning language models. This approach involves augmenting language models with retrieved information from external sources to enhance their performance on various NLP tasks. In RAFT, the language model is fine-tuned on a specific task while also incorporating information retrieved from external knowledge sources. This retrieved infor
In Summary¶
In this comprehensive notebook, we covered how to bring in context from MCP servers (or sources). More specifically, we went through:
- How to build and interact with an
MCPStdioKnowledgeSource
- How to build and interact with an
MCPKnowledgeStore
that is connected to these sources - How to then assemble a NoEncode RAG system that combines the
MCPKnowledgeStore
with a chosen generator LLM - How to define and use a reranker callback to better prioritize the retrieved knowledge nodes from the multiple MCP sources
What's Next¶
After assembling the NoEncodeRAGSystem
, you can use it with any GeneratorTrainers
(e.g., HuggingFaceTrainerForRALT) to fine-tune the RAG system to better adapt to your MCP knowledge store.
from datasets import Dataset
from fed_rag.trainers.huggingface import HuggingFaceTrainerForRALT
# define a train dataset
train_dataset = Dataset.from_dict(
# examples from Commonsense QA
{
"query": [
"The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?",
"Sammy wanted to go to where the people were. Where might he go?",
"To locate a choker not located in a jewelry box or boutique where would you go?",
"Google Maps and other highway and street GPS services have replaced what?",
"The fox walked from the city into the forest, what was it looking for?",
],
"response": [
"ignore",
"populated areas",
"jewelry store",
"atlas",
"natural habitat",
],
}
)
# the trainer object
generator_trainer = HuggingFaceTrainerForRALT(
rag_system=rag_system.to_sync(), # trainers only work with sync objects
train_dataset=train_dataset,
)
train_result = generator_trainer.train()