Let’s delve into the step-by-step implementation of CMRAG.
Multimodal Parsing
The next libraries should be put in for operating the code mentioned on this article.
!pip set up llama-index ipython cohere rank-bm25 pydantic nest-asyncio python-dotenv openai llama-parse
All libraries to be imported to run the entire code are talked about within the GitHub notebook. For this text, I used Key Figures on Immigration in Finland (licensed under CC By 4.0, re-use allowed) which accommodates a number of graphs, photos, and textual content knowledge.
LlamaParse presents multimodal parsing utilizing a vendor multimodal mannequin (comparable to gpt-4o) to deal with doc extraction.
parser = LlamaParse(
use_vendor_multimodal_model=True
vendor_multimodal_model_name="openai-gpt-4o"
vendor_multimodal_api_key=sk-proj-xxxxxx
)
On this mode, a screenshot of each web page of a doc is taken, which is then despatched to the multimodal mannequin with directions to extract as markdown. The markdown results of every web page is consolidated into the ultimate output.
The latest LlamaParse Premium mode presents superior multimodal doc parsing, extracting textual content, tables, and pictures into well-structured markdown whereas considerably decreasing lacking content material and hallucinations. It may be utilized by making a free account at Llama Cloud Platform and acquiring an API key. The free plan presents to parse 1,000 pages per day.
LlamaParse premium mode is used as follows:
from llama_parse import LlamaParse
import os# Operate to learn all recordsdata from a specified listing
def read_docs(data_dir) -> Listing[str]:
recordsdata = []
for f in os.listdir(data_dir):
fname = os.path.be part of(data_dir, f)
if os.path.isfile(fname):
recordsdata.append(fname)
return recordsdata
parser = LlamaParse(
result_type="markdown",
premium_mode=True,
api_key=os.getenv("LLAMA_CLOUD_API_KEY")
)
recordsdata = read_docs(data_dir = DATA_DIR)
We begin with studying a doc from a specified listing, parse the doc utilizing the parser’s get_json_result() technique, and get picture dictionaries utilizing the parser’s get_images() technique. Subsequently, the nodes are extracted and despatched to the LLM to assign context based mostly on the general doc utilizing the retrieve_nodes() technique. Parsing of this doc (60 pages), together with getting picture dictionaries, took 5 minutes and 34 seconds(a one-time course of).
print("Parsing...")
json_results = parser.get_json_result(recordsdata)
print("Getting picture dictionaries...")
photos = parser.get_images(json_results, download_path=image_dir)
print("Retrieving nodes...")
json_results[0]["pages"][3]
Contextual Retrieval
Particular person nodes and the related photos (screenshots) are extracted by retrieve_nodes() perform from the parsed josn_results. Every node is shipped to _assign_context() perform together with all of the nodes (doc variable within the under code). The _assign_context() perform makes use of a immediate template CONTEXT_PROMPT_TMPL (adopted and modified from this source) so as to add a concise context to every node. This manner, we combine metadata, markdown textual content, context, and uncooked textual content into the node.
The next code exhibits the implementation of retrieve_nodes() perform. The 2 helper capabilities, _get_sorted_image_files() and get_img_page_number(), get sorted picture recordsdata by web page and the web page variety of photos, respectively. The general purpose is to not rely solely on the uncooked textual content as the straightforward RAGs do to generate the ultimate reply, however to think about metadata, markdown textual content, context, and uncooked textual content, in addition to the entire photos (screenshots) of the retrieved nodes (picture hyperlinks within the node’s metadata) to generate the ultimate response.
# Operate to get web page variety of photos utilizing regex on file names
def get_img_page_number(file_name):
match = re.search(r"-page-(d+).jpg$", str(file_name))
if match:
return int(match.group(1))
return 0# Operate to get picture recordsdata sorted by web page
def _get_sorted_image_files(image_dir):
raw_files = [f for f in list(Path(image_dir).iterdir()) if f.is_file()]
sorted_files = sorted(raw_files, key=get_img_page_number)
return sorted_files
# Context immediate template for contextual chunking
CONTEXT_PROMPT_TMPL = """
You're an AI assistant specializing in doc evaluation. Your job is to supply temporary, related context for a bit of textual content from the given doc.
Right here is the doc:
<doc>
{doc}
</doc>
Right here is the chunk we wish to situate inside the entire doc:
<chunk>
{chunk}
</chunk>
Present a concise context (2-3 sentences) for this chunk, contemplating the next pointers:
1. Determine the primary subject or idea mentioned within the chunk.
2. Point out any related data or comparisons from the broader doc context.
3. If relevant, notice how this data pertains to the general theme or function of the doc.
4. Embody any key figures, dates, or percentages that present essential context.
5. Don't use phrases like "This chunk discusses" or "This part offers". As a substitute, instantly state the context.
Please give a brief succinct context to situate this chunk throughout the general doc to enhance search retrieval of the chunk.
Reply solely with the succinct context and nothing else.
Context:
"""
CONTEXT_PROMPT = PromptTemplate(CONTEXT_PROMPT_TMPL)
# Operate to generate context for every chunk
def _assign_context(doc: str, chunk: str, llm) -> str:
immediate = CONTEXT_PROMPT.format(doc=doc, chunk=chunk)
response = llm.full(immediate)
context = response.textual content.strip()
return context
# Operate to create textual content nodes with context
def retrieve_nodes(json_results, image_dir, llm) -> Listing[TextNode]:
nodes = []
for end in json_results:
json_dicts = consequence["pages"]
document_name = consequence["file_path"].cut up('/')[-1]
docs = [doc["md"] for doc in json_dicts] # Extract textual content
image_files = _get_sorted_image_files(image_dir) # Extract photos
# Be part of all docs to create the total doc textual content
document_text = "nn".be part of(docs)
for idx, doc in enumerate(docs):
# Generate context for every chunk (web page)
context = _assign_context(document_text, doc, llm)
# Mix context with the unique chunk
contextualized_content = f"{context}nn{doc}"
# Create the textual content node with the contextualized content material
chunk_metadata = {"page_num": idx + 1}
chunk_metadata["image_path"] = str(image_files[idx])
chunk_metadata["parsed_text_markdown"] = docs[idx]
node = TextNode(
textual content=contextualized_content,
metadata=chunk_metadata,
)
nodes.append(node)
return nodes
# Get textual content nodes
text_node_with_context = retrieve_nodes(json_results, image_dir, llm)First web page of the report (picture by writer)First web page of the report (picture by writer)
Right here is the depiction of a node equivalent to the primary web page of the report.
Enhancing Contextual Retrieval with BM25 and Re-ranking
All of the nodes with metadata, uncooked textual content, markdown textual content, and context data are then listed right into a vector database. BM25 indices for the nodes are created and saved in a pickle file for question inference. The processed nodes are additionally saved for later use (text_node_with_context.pkl).
# Create the vector retailer index
index = VectorStoreIndex(text_node_with_context, embed_model=embed_model)
index.storage_context.persist(persist_dir=output_dir)
# Construct BM25 index
paperwork = [node.text for node in text_node_with_context]
tokenized_documents = [doc.split() for doc in documents]
bm25 = BM25Okapi(tokenized_documents)
# Save bm25 and text_node_with_context
with open(os.path.be part of(output_dir, 'tokenized_documents.pkl'), 'wb') as f:
pickle.dump(tokenized_documents, f)
with open(os.path.be part of(output_dir, 'text_node_with_context.pkl'), 'wb') as f:
pickle.dump(text_node_with_context, f)
We are able to now initialize a question engine to ask queries utilizing the next pipeline. However earlier than that, the next immediate is about to information the conduct of the LLM to generate the ultimate response. A multimodal LLM (gpt-4o-mini) is initialized to generate the ultimate response. This immediate might be adjusted as wanted.
# Outline the QA immediate template
RAG_PROMPT = """
Under we give parsed textual content from paperwork in two totally different codecs, in addition to the picture.---------------------
{context_str}
---------------------
Given the context data and never prior data, reply the question. Generate the reply by analyzing parsed markdown, uncooked textual content and the associated
picture. Particularly, fastidiously analyze the pictures to search for the required data.
Format the reply in correct format as deems appropriate (bulleted lists, sections/sub-sections, tables, and so forth.)
Give the web page's quantity and the doc title the place you discover the response based mostly on the Context.
Question: {query_str}
Reply: """
PROMPT = PromptTemplate(RAG_PROMPT)
# Initialize the multimodal LLM
MM_LLM = OpenAIMultiModal(mannequin="gpt-4o-mini", temperature=0.0, max_tokens=16000)
Integrating the Entire Pipeline in a Question Engine
The next QueryEngine class implements the above-mentioned workflow. The variety of nodes in BM25 search (top_n_bm25) and the variety of re-ranked outcomes (top_n) by the re-ranker might be adjusted as required. The BM25 search and re-ranking might be chosen or de-selected by toggling the best_match_25 and re_ranking variables within the GitHub code.
Right here is the general workflow carried out by QueryEngine class.
- Discover question embeddings
- Retrieve nodes from the vector database utilizing vector-based retrieval
- Retrieve nodes with BM25 search (if chosen)
- Mix nodes from each BM25 and vector-based retrieval. Discover the distinctive variety of nodes (take away duplicated)
- Apply re-ranking to re-rank the mixed outcomes (if chosen). Right here, we use Cohere’s rerank-english-v2.0 re-ranker mannequin. You possibly can create an account at Cohere’s website to get the trial API keys.
- Create picture nodes from the pictures related to the nodes
- Create context string from the parsed markdown textual content
- Ship the node photos to the multimodal LLM for interpretation.
- Generate the ultimate response by sending the textual content nodes, picture node descriptions, and metadata to the LLM.
# DeFfine the QueryEngine integrating all strategies
class QueryEngine(CustomQueryEngine):
# Public fields
qa_prompt: PromptTemplate
multi_modal_llm: OpenAIMultiModal
node_postprocessors: Optionally available[List[BaseNodePostprocessor]] = None# Non-public attributes utilizing PrivateAttr
_bm25: BM25Okapi = PrivateAttr()
_llm: OpenAI = PrivateAttr()
_text_node_with_context: Listing[TextNode] = PrivateAttr()
_vector_index: VectorStoreIndex = PrivateAttr()
def __init__(
self,
qa_prompt: PromptTemplate,
bm25: BM25Okapi,
multi_modal_llm: OpenAIMultiModal,
vector_index: VectorStoreIndex,
node_postprocessors: Optionally available[List[BaseNodePostprocessor]] = None,
llm: OpenAI = None,
text_node_with_context: Listing[TextNode] = None,
):
tremendous().__init__(
qa_prompt=qa_prompt,
retriever=None,
multi_modal_llm=multi_modal_llm,
node_postprocessors=node_postprocessors
)
self._bm25 = bm25
self._llm = llm
self._text_node_with_context = text_node_with_context
self._vector_index = vector_index
def custom_query(self, query_str: str):
# Put together the question bundle
query_bundle = QueryBundle(query_str)
bm25_nodes = []
if best_match_25 == 1: # if BM25 search is chosen
# Retrieve nodes utilizing BM25
query_tokens = query_str.cut up()
bm25_scores = self._bm25.get_scores(query_tokens)
top_n_bm25 = 5 # Modify the variety of prime nodes to retrieve
# Get indices of prime BM25 scores
top_indices_bm25 = bm25_scores.argsort()[-top_n_bm25:][::-1]
bm25_nodes = [self._text_node_with_context[i] for i in top_indices_bm25]
logging.data(f"BM25 nodes retrieved: {len(bm25_nodes)}")
else:
logging.data("BM25 not chosen.")
# Retrieve nodes utilizing vector-based retrieval from the vector retailer
vector_retriever = self._vector_index.as_query_engine().retriever
vector_nodes_with_scores = vector_retriever.retrieve(query_bundle)
# Specify the variety of prime vectors you need
top_n_vectors = 5 # Modify this worth as wanted
# Get solely the highest 'n' nodes
top_vector_nodes_with_scores = vector_nodes_with_scores[:top_n_vectors]
vector_nodes = [node.node for node in top_vector_nodes_with_scores]
logging.data(f"Vector nodes retrieved: {len(vector_nodes)}")
# Mix nodes and take away duplicates
all_nodes = vector_nodes + bm25_nodes
unique_nodes_dict = {node.node_id: node for node in all_nodes}
unique_nodes = listing(unique_nodes_dict.values())
logging.data(f"Distinctive nodes after deduplication: {len(unique_nodes)}")
nodes = unique_nodes
if re_ranking == 1: # if re-ranking is chosen
# Apply Cohere Re-ranking to rerank the mixed outcomes
paperwork = [node.get_content() for node in nodes]
max_retries = 3
for try in vary(max_retries):
attempt:
reranked = cohere_client.rerank(
mannequin="rerank-english-v2.0",
question=query_str,
paperwork=paperwork,
top_n=3 # top-3 re-ranked nodes
)
break
besides CohereError as e:
if try < max_retries - 1:
logging.warning(f"Error occurred: {str(e)}. Ready for 60 seconds earlier than retry {try + 1}/{max_retries}")
time.sleep(60) # Wait earlier than retrying
else:
logging.error("Error occurred. Max retries reached. Continuing with out re-ranking.")
reranked = None
break
if reranked:
reranked_indices = [result.index for result in reranked.results]
nodes = [nodes[i] for i in reranked_indices]
else:
nodes = nodes[:3] # Fallback to prime 3 nodes
logging.data(f"Nodes after re-ranking: {len(nodes)}")
else:
logging.data("Re-ranking not chosen.")
# Restrict and filter node content material for context string
max_context_length = 16000 # Modify as required
current_length = 0
filtered_nodes = []
# Initialize tokenizer
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
for node in nodes:
content material = node.get_content(metadata_mode=MetadataMode.LLM).strip()
node_length = len(tokenizer.encode(content material))
logging.data(f"Node ID: {node.node_id}, Content material Size (tokens): {node_length}")
if not content material:
logging.warning(f"Node ID: {node.node_id} has empty content material. Skipping.")
proceed
if current_length + node_length <= max_context_length:
filtered_nodes.append(node)
current_length += node_length
else:
logging.data(f"Reached max context size with Node ID: {node.node_id}")
break
logging.data(f"Filtered nodes for context: {len(filtered_nodes)}")
# Create context string
ctx_str = "nn".be part of(
[n.get_content(metadata_mode=MetadataMode.LLM).strip() for n in filtered_nodes]
)
# Create picture nodes from the pictures related to the nodes
image_nodes = []
for n in filtered_nodes:
if "image_path" in n.metadata:
image_nodes.append(
NodeWithScore(node=ImageNode(image_path=n.metadata["image_path"]))
)
else:
logging.warning(f"Node ID: {n.node_id} lacks 'image_path' metadata.")
logging.data(f"Picture nodes created: {len(image_nodes)}")
# Put together immediate for the LLM
fmt_prompt = self.qa_prompt.format(context_str=ctx_str, query_str=query_str)
# Use the multimodal LLM to interpret photos and generate a response
llm_response = self.multi_modal_llm.full(
immediate=fmt_prompt,
image_documents=[image_node.node for image_node in image_nodes],
max_tokens=16000
)
logging.data(f"LLM response generated.")
# Return the ultimate response
return Response(
response=str(llm_response),
source_nodes=filtered_nodes,
metadata={
"text_node_with_context": self._text_node_with_context,
"image_nodes": image_nodes,
},
)
# Initialize the question engine with BM25, Cohere Re-ranking, and Question Growth
query_engine = QueryEngine(
qa_prompt=PROMPT,
bm25=bm25,
multi_modal_llm=MM_LLM,
vector_index=index,
node_postprocessors=[],
llm=llm,
text_node_with_context=text_node_with_context
)
print("All executed")
A bonus of utilizing OpenAI fashions, particularly gpt-4o-mini, is way decrease value for context project and question inference operating, in addition to a lot smaller context project time. Whereas the essential tiers of each OpenAI and Anthropic do rapidly hit the utmost price restrict of API calls, retry time in Anthropic’s primary tier range and may very well be too lengthy. Context project course of for under first 20 pages of this doc with claude-3–5-sonnet-20240620 took roughly 170 seconds with immediate caching and costed 20 cents (enter + output tokens). Whereas, gpt-4o-mini is roughly 20x cheaper compared to Claude 3.5 Sonnet for input tokens and roughly 25x cheaper for output tokens. OpenAI claims to implement prompt caching for repetitive content material which works robotically for all API calls.
Compared, the context project to nodes on this whole doc (60 pages) via gpt-4o-mini accomplished in roughly 193 seconds with none retry request.
After implementing the QueryEngine class, we are able to run the question inference as follows:
original_query = """What are the highest nations to whose residents the Finnish Immigration Service issued the best variety of first residence permits in 2023?
Which of those nations acquired the best variety of first residence permits?"""
response = query_engine.question(original_query)
show(Markdown(str(response)))
Right here is the markdown response to this question.
The pages cited within the question response are the next.
Now let’s examine the efficiency of gpt-4o-mini based mostly RAG (LlamaParse premium + context retrieval + BM25 + re-ranking) with Claude based mostly RAG (LlamaParse premium + context retrieval). I additionally carried out a easy, baseline RAG which might be present in GitHub’s pocket book. Listed here are the three RAGs to be in contrast.
- Easy RAG in LlamaIndex utilizing SentenceSplitter to separate the paperwork into chunks (chunk_size = 800, chunk_overlap= 400), making a vector index and vector retrieval.
- CMRAG (claude-3–5-sonnet-20240620, voyage-3) — LlamaParse premium mode + context retrieval
- CMRAG (gpt-4o-mini, text-embedding-3-small) — LlamaParse premium mode + context retrieval + BM25 + re-ranking
For the sake of simplicity, we refer to those RAGs as RAG0, RAG1, and RAG2, respectively. Listed here are three pages from the report from the place I requested three questions (1 query from every web page) to every RAG. The areas highlighted by the purple rectangles present the bottom fact or the place from the place the correct reply ought to come from.
Listed here are the responses to the three RAGs to every query.
It may be seen that RAG2 performs very nicely. For the primary query, RAG0 offers a fallacious reply as a result of the query was requested from a picture. Each RAG1 and RAG2 supplied the correct reply to this query. For the opposite two questions, RAG0 couldn’t present any reply. Whereas, each RAG1 and RAG2, supplied proper solutions to those questions.
General, RAG2’s efficiency was equal and even higher than RAG1 in lots of circumstances because of the integration of BM25, re-ranking, and higher prompting. It offers a cheap resolution to a contextual, multimodal RAG. A doable integration on this pipeline may very well be hypothetical doc embedding (hyde) or question extension. Equally, open-source embedding fashions (comparable to all-MiniLM-L6-v2) and/or lightweight LLMs (comparable to gemma2 or phi-3-small) may be explored to make it more economical.