A Multimodal AI Assistant: Combining Local and Cloud Models | by Robert Martin-Short

Spectacular! One may argue about whether or not or not it actually discovered all of the skyscrapers right here however I really feel like such a system has the potential to be fairly highly effective and helpful, particularly if we had been so as to add the flexibility to crop the bounding packing containers, zoom in and proceed the dialog.

Within the following sections, let’s dive into the primary steps in a bit extra element. My hope is that a few of them may be informative in your initiatives too.

My previous article incorporates a extra detailed dialogue of brokers and LangGraph, so right here I’ll simply contact within the agent state for this challenge. The AgentState is made accessible to all of the nodes within the LangGraph graph, and it’s the place the data related to a question will get saved.

Every node might be advised to write down to considered one of extra variables within the state, and by default they get overwritten. This isn’t the conduct we would like for the plan output, which is meant to be a listing of outcomes from every step of the plan. To make sure that this record will get appended because the agent goes about its work we use the add reducer, which you’ll be able to learn extra about here.

Every of the nodes within the graph above is a technique within the class AgentNodes. They absorb state, carry out some motion (usually calling an LLM) and output their updates to the state. For instance, right here’s the node used to construction the plan, copied from the code here.

   def structure_plan_node(self, state: dict) -> dict:messages = state["plan"]
response = self.llm_structure.name(messages)
final_plan_dict = self.post_process_plan_structure(response)
final_plan = json.dumps(final_plan_dict)
return {
"plan_structure": final_plan,
"current_step": 0,
"max_steps": len(final_plan_dict),
}

The routing node can be vital as a result of it’s visited a number of occasions over the course of plan execution. Within the present code it’s quite simple, simply updating the present step state worth in order that different nodes know which a part of the plan construction record to have a look at.

   def routing_node(self, state: dict) -> dict:plan_stage = state.get("current_step", 0)
return {"current_step": plan_stage + 1}

An extension right here could be so as to add one other LLM name within the routing node to examine if the output of the earlier step of the plan warrants any modifications to the following steps or early termination of the query has been answered.

Lastly we have to add two conditional edges, which use information saved within the AgentStateto find out which node ought to be run subsequent. For instance, the choose_model edge appears to be like on the identify of the present step within the plan_structure object carried in AgentState after which makes use of a easy if stagement to return the identify of corresponding node that ought to be known as at that step.

    def choose_model(state: dict) -> str:current_plan = json.hundreds(state.get("plan_structure"))
current_step = state.get("current_step", 1)
max_step = state.get("max_steps", 999)
if current_step > max_step:
return "finalize"
else:
step_to_execute = current_plan[str(current_step)]["tool_name"]
return step_to_execute

Your complete agent construction appears to be like like this.

edges: AgentEdges = AgentEdges()
nodes: AgentNodes = AgentNodes()
agent: StateGraph = StateGraph(AgentState)## Nodes
agent.add_node("planning", nodes.plan_node)
agent.add_node("structure_plan", nodes.structure_plan_node)
agent.add_node("routing", nodes.routing_node)
agent.add_node("special_vision", nodes.call_special_vision_node)
agent.add_node("general_vision", nodes.call_general_vision_node)
agent.add_node("evaluation", nodes.assessment_node)
agent.add_node("response", nodes.dump_result_node)
## Edges
agent.set_entry_point("planning")
agent.add_edge("planning", "structure_plan")
agent.add_edge("structure_plan", "routing")
agent.add_conditional_edges(
"routing",
edges.choose_model,
{
"special_vision": "special_vision",
"general_vision": "general_vision",
"finalize": "evaluation",
},
)
agent.add_edge("special_vision", "routing")
agent.add_edge("general_vision", "routing")
agent.add_conditional_edges(
"evaluation",
edges.back_to_plan,
{
"good_answer": "response",
"bad_answer": "planning",
"timeout": "response",
},
)
agent.add_edge("response", END)

And it may be vizualized in a pocket book utilizing the turorial here.

The planning, construction and evaluation nodes are ideally suited to a text-based LLM that may purpose and produce structured outputs. Probably the most easy choice right here is to go along with a big, versatile mannequin like GPT4o-mini, which has the advantage of excellent support for JSON output from a Pydantic schema.

With the assistance of some LangChain performance, we will make class to name such a mannequin.

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAIclass StructuredOpenAICaller:
def __init__(
self, api_key, system_prompt, output_model, temperature=0, max_tokens=1000
):
self.temperature = temperature
self.max_tokens = max_tokens
self.system_prompt = system_prompt
self.output_model = output_model
self.llm = ChatOpenAI(
mannequin=self.MODEL_NAME,
api_key=api_key,
temperature=temperature,
max_tokens=max_tokens,
)
self.chain = self._set_up_chain()
def _set_up_chain(self):
immediate = ChatPromptTemplate.from_messages(
[
("system", self.system_prompt.system_template),
("human", "{query}"),
]
)
structured_llm = self.llm.with_structured_output(self.output_model)
chain = immediate | structured_llm
return chain
def name(self, question):
return self.chain.invoke({"question": question})

To set this up, we provide a system immediate and an output mannequin (see here for some examples of those) after which we will simply use the decision technique with an enter string to get a response that conforms to the construction of the output mannequin that we specified. With the code arrange like this we’d must make a brand new occasion of StructuredOpenAICaller with each completely different system immediate and output mannequin we used within the agent. I personally want this to maintain monitor of the completely different fashions getting used, however because the agent turns into extra complicated it may very well be modified with one other technique to instantly replace the system immediate and output mannequin within the single occasion of the category.

Can we do that with native fashions too? On Apple Silicon, we will use the MLX library and MLX group on Hugging Face to simply experiment with open supply fashions like Llama3.2. LangChain additionally has help for MLX integration, so we will observe the construction of the category above to arrange a neighborhood mannequin.

from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.llms.mlx_pipeline import MLXPipeline
from langchain_community.chat_models.mlx import ChatMLXclass StructuredLlamaCaller:
MODEL_PATH = "mlx-community/Llama-3.2-3B-Instruct-4bit"
def __init__(
self,
system_prompt: Any,
output_model: Any,
temperature: float = 0,
max_tokens: int = 1000,
) -> None:
self.system_prompt = system_prompt
# that is the identify of the Pydantic mannequin that defines
# the construction we wish to output
self.output_model = output_model
self.loaded_model = MLXPipeline.from_model_id(
self.MODEL_PATH,
pipeline_kwargs={"max_tokens": max_tokens, "temp": temperature, "do_sample": False},
)
self.llm = ChatMLX(llm=self.loaded_model)
self.temperature = temperature
self.max_tokens = max_tokens
self.chain = self._set_up_chain()
def _set_up_chain(self) -> Any:
# Arrange a parser
parser = PydanticOutputParser(pydantic_object=self.output_model)
# Immediate
immediate = ChatPromptTemplate.from_messages(
[
(
"system",
self.system_prompt.system_template,
),
("human", "{query}"),
]
).partial(format_instructions=parser.get_format_instructions())
chain = immediate | self.llm | parser
return chain
def name(self, question: str) -> Any:
return self.chain.invoke({"question": question})

There are a number of attention-grabbing factors right here. For a begin, we will simply obtain the weights and config for Llama3.2 as we’d some other Hugging Face mannequin, then below the hood they’re loaded into MLX utilizing the MLXPipeline device from LangChain. When the fashions are first downloaded they’re robotically positioned within the Hugging Face cache. Typically it’s fascinating to record the fashions and their cache places, for instance if you wish to copy a mannequin to a brand new setting. The util scan_cache_dir will assist right here and can be utilized to make a helpful consequence with this perform.

from huggingface_hub import scan_cache_dirdef fetch_downloaded_model_details():
hf_cache_info = scan_cache_dir()
repo_paths = []
size_on_disk = []
repo_ids = []
for repo in sorted(
hf_cache_info.repos, key=lambda repo: repo.repo_path
):
repo_paths.append(str(repo.repo_path))
size_on_disk.append(repo.size_on_disk)
repo_ids.append(repo.repo_id)
repos_df = pd.DataFrame({
"local_path":repo_paths,
"size_on_disk":size_on_disk,
"model_name":repo_ids
})
repos_df.set_index("model_name",inplace=True)
return repos_df.to_dict(orient="index")

Llama3.2 doesn’t have a built-in help for structured output like GPT4o-mini, so we have to use the immediate to power it to generate JSON. LangChain’s PydanticOutputParser might help, though it it additionally attainable to implement your personal model of this as proven here.

In my expertise, the model of Llama that I’m utilizing right here, particularly Llama-3.2–3B-Instruct-4bit, is just not dependable for structured output past the best schemas. It’s moderately good on the “plan technology” stage of our agent given a immediate with a number of examples, however even with the assistance of the directions supplied by PydanticOutputParser, it usually fails to show that plan into JSON. Bigger and/or much less quantized variations of Llama will doubtless be higher, however they might run into RAM points if run alongside the opposite fashions in our agent. Due to this fact going forwards within the challenge, the orchestration mdoel is about to be GPT4o-mini.

To have the ability to reply questions like “What’s happening on this picture?” or “what metropolis is that this?”, we’d like a multimodal LLM. Arguably Florence2 in picture captioning mode may be to offer good responses to this sort of query, but it surely’s probably not designed for conversational output.

The sector of multimodal fashions sufficiently small to run on a laptop computer continues to be in its infancy (a not too long ago compiled record might be discovered here), however the Qwen2-VL series from Alibaba is a promising growth. Moreover, we will make use of MLX-VLM, an extension of MLX particularly designed for tuning and inference of imaginative and prescient fashions, to arrange considered one of these fashions inside our agent framework.

from mlx_vlm import load, apply_chat_template, generateclass QwenCaller:
MODEL_PATH = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
def __init__(self, max_tokens=1000, temperature=0):
self.mannequin, self.processor = load(self.MODEL_PATH)
self.config = self.mannequin.config
self.max_tokens = max_tokens
self.temperature = temperature
def name(self, question, picture):
messages = [
{
"role": "system",
"content": ImageInterpretationPrompt.system_template,
},
{"role": "user", "content": query},
]
immediate = apply_chat_template(self.processor, self.config, messages)
output = generate(
self.mannequin,
self.processor,
picture,
immediate,
max_tokens=self.max_tokens,
temperature=self.temperature,
)
return output

This class will load the smallest model of Qwen2-VL after which name it with an enter picture and immediate to get a textual response. For extra element in regards to the performance of this mannequin and others that may very well be utilized in the identical approach, take a look at this list of examples on the MLX-VLM github web page. Qwen2-VL can be apparently able to producing bounding packing containers and object pointing coordinates, so this functionality is also explored and in contrast with Florence2.

In fact GPT-4o-mini additionally has imaginative and prescient capabilities and is probably going extra dependable than smaller native fashions. Due to this fact when constructing these types of purposes it’s helpful so as to add the flexibility to name a cloud based mostly various, if something simply as a backup in case one of many native fashions fails. Observe that enter pictures have to be transformed to base64 earlier than they are often despatched to the mannequin, however as soon as that’s achieved we will additionally use the LangChain framework as proven beneath.

import base64
from io import BytesIO
from PIL import Picture
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParserdef convert_PIL_to_base64(picture: Picture, format="jpeg"):
buffer = BytesIO()
# Save the picture to this buffer within the specified format
picture.save(buffer, format=format)
# Get binary information from the buffer
image_bytes = buffer.getvalue()
# Encode binary information to Base64
base64_encoded = base64.b64encode(image_bytes)
# Convert Base64 bytes to string (elective)
return base64_encoded.decode("utf-8")
class OpenAIVisionCaller:
MODEL_NAME = "gpt-4o-mini"
def __init__(self, api_key, system_prompt, temperature=0, max_tokens=1000):
self.temperature = temperature
self.max_tokens = max_tokens
self.system_prompt = system_prompt
self.llm = ChatOpenAI(
mannequin=self.MODEL_NAME,
api_key=api_key,
temperature=temperature,
max_tokens=max_tokens,
)
self.chain = self._set_up_chain()
def _set_up_chain(self):
immediate = ChatPromptTemplate.from_messages(
[
("system", self.system_prompt.system_template),
(
"user",
[
{"type": "text", "text": "{query}"},
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,{image_data}"},
},
],
),
]
)
chain = immediate | self.llm | StrOutputParser()
return chain
def name(self, question, picture):
base64image = convert_PIL_to_base64(picture)
return self.chain.invoke({"question": question, "image_data": base64image})

Florence2 is seen as a specialist mannequin within the context of our agent as a result of whereas it has many capabilities its inputs have to be chosen from a listing of predefined job prompts. In fact the mannequin may very well be high-quality tuned to just accept new prompts, however for our functions the model downloaded instantly from Hugging Face works nicely. The fantastic thing about this mannequin is that it makes use of a single coaching course of and set of weights, however but achieves excessive efficiency in a number of picture duties that beforehand would have demanded their very own fashions. The important thing to this success lies in its giant and thoroughly curated coaching dataset, FLD-5B. To be taught extra in regards to the dataset, mannequin and coaching I like to recommend this excellent article.

In our context, we use the orchestration mannequin to show the question right into a sequence of Florence job prompts, which we then name in a sequence. The choices out there to us embody captioning, object detection, phrase grounding, OCR and segmentation. For a few of these choices (i.e. phrase grounding and area to segmentation) an enter phrase is required, so the orchestration mannequin generates that too. In distinction, duties like captioning want solely the picture. There are a lot of use instances for Florence2, that are explored in code here. We prohibit ourselves to object detection, phrase grounding, captioning and OCR, although it might be easy so as to add extra by updating the prompts related to plan technology and structuring.

Florence2 seems to be supported by the MLX-VLM package deal, however on the time of writing I couldn’t discover any examples of its use and so opted for an method that makes use of Hugging Face transformers as proven beneath.

from transformers import AutoModelForCausalLM, AutoProcessor
import torchdef get_device_type():
if torch.cuda.is_available():
return "cuda"
else:
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
return "mps"
else:
return "cpu"
class FlorenceCaller:
MODEL_PATH: str = "microsoft/Florence-2-base-ft" 
# See https://huggingface.co/microsoft/Florence-2-base-ft for different modes 
# for Florence2 
TASK_DICT: dict[str, str] = {
"normal object detection": "<OD>",
"particular object detection": "<CAPTION_TO_PHRASE_GROUNDING>",
"picture captioning": "<MORE_DETAILED_CAPTION>",
"OCR": "<OCR_WITH_REGION>",
}
def __init__(self) -> None:
self.machine: str = (
get_device_type()
)  # Perform to find out the machine kind (e.g., 'cpu' or 'cuda').
with patch("transformers.dynamic_module_utils.get_imports", fixed_get_imports):
self.mannequin: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained(
self.MODEL_PATH, trust_remote_code=True
)
self.processor: AutoProcessor = AutoProcessor.from_pretrained(
self.MODEL_PATH, trust_remote_code=True
)
self.mannequin.to(self.machine)
def translate_task(self, task_name: str) -> str:
return self.TASK_DICT.get(task_name, "<DETAILED_CAPTION>")
def name(
self, task_prompt: str, picture: Any, text_input: Optionally available[str] = None
) -> Any:
# Get the corresponding job code for the given immediate
task_code: str = self.translate_task(task_prompt)
# Forestall text_input for duties that don't require it
if task_code in [
"<OD>",
"<MORE_DETAILED_CAPTION>",
"<OCR_WITH_REGION>",
"<DETAILED_CAPTION>",
]:
text_input = None
# Assemble the immediate based mostly on whether or not text_input is supplied
immediate: str = task_code if text_input is None else task_code + text_input
# Preprocess inputs for the mannequin
inputs = self.processor(textual content=immediate, pictures=picture, return_tensors="pt").to(
self.machine
)
# Generate predictions utilizing the mannequin
generated_ids = self.mannequin.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
early_stopping=False,
do_sample=False,
num_beams=3,
)
# Decode and course of generated output
generated_text: str = self.processor.batch_decode(
generated_ids, skip_special_tokens=False
)[0]
parsed_answer: dict[str, Any] = self.processor.post_process_generation(
generated_text, job=task_code, image_size=(picture.width, picture.peak)
)
return parsed_answer[task_code]

On Apple Silicon, the machine turns into mps and the latency of those mannequin calls is tolerable. This code must also work on GPU and CPU, although this has not been examined.

Let’s run via one other instance to see the agent outputs from every step. To name the agent on an enter question and picture we will use the Agent.invoke technique, which follows the identical course of as described in my previous article so as to add every node output to a listing of outcomes along with saving outputs in a LangGraph InMemoryStore object.

We’ll be utilizing the next picture, which presents an attention-grabbing problem if we ask a difficult query like “Are there bushes on this picture? In that case, discover them and describe what they’re doing”

Testing picture for this part. Picture by Hannah Lim on Unsplash


from image_agent.agent.Agent import Agent
from image_agent.utils import load_secretssecrets and techniques = load_secrets()
# use GPT4 for normal imaginative and prescient mode
full_agent_gpt_vision = Agent(
openai_api_key=secrets and techniques["OPENAI_API_KEY"],vision_mode="gpt"
)
# use native mannequin for normal imaginative and prescient 
full_agent_qwen_vision = Agent(
openai_api_key=secrets and techniques["OPENAI_API_KEY"],vision_mode="native"
)

In an excellent world the reply is simple: There aren’t any bushes.

Nevertheless this seems to be a tough query for the agent and it’s attention-grabbing to check the responses when it utilizing GPT-4o-mini vs. Qwen2 as the overall imaginative and prescient mannequin.

Once we name full_agent_qwen_vision with this question, we get a nasty consequence: Each Qwen2 and Florence2 fall for the trick and report that bushes are current (apparently, if we modify “bushes” to “canine”, we get the fitting reply)

Plan: 
Name generalist imaginative and prescient with the query 'Are there bushes on this picture? In that case, what are they doing?'. Then name specialist imaginative and prescient in object particular mode with the phrase 'cat'.Plan_structure: 
{
"1": {"tool_name": "general_vision", "tool_mode": "dialog", "tool_input": "Are there bushes on this picture? In that case, what are they doing?"}, 
"2": {"tool_name": "special_vision", "tool_mode": "particular object detection", "tool_input": "tree"}
}
Plan output:
[
{1: 'Yes, there are trees in the image. They appear to be part of a tree line against a pink background.'}
[
{2: '{"bboxes": [[235.77601623535156, 427.864501953125, 321.7920227050781, 617.2275390625]], "labels": ["tree"]}'}
]
Evaluation: 
The consequence adequately solutions the person's query by confirming the presence of bushes within the picture and offering an outline of their look and context. The output from each the generalist and specialist imaginative and prescient instruments is constant and informative.

Qwen2 appears topic to blindly following the prompts trace that right here may be bushes current. Florence2 additionally fails right here, reporting a bounding field when it shouldn’t

If requested “Are there bushes on this picture, In that case, discover them and describe what they’re doing”, each Qwen2 and Florence2 fall for the trick. Picture generated by the creator.

If requested “Are there canine on this picture? In that case, discover them and describe what they’re doing”, each the Qwen and GPT-based brokers will produce the proper reply. Picture generated by the creator.

If we name full_agent_gpt_visionwith the identical question, GPT4o-mini doesn’t fall for the trick, however the name to Florence2 hasn’t modified so it nonetheless fails. We then see the question evaluation step in motion as a result of the generalist and specialist fashions have produced conflicting outcomes.

Node : general_vision
Activity : plan_output
[
{1: 'There are no trees in this image. It features a group of dogs sitting in front of a pink wall.'}
]Node : special_vision
Activity : plan_output
[
{2: '{"bboxes": [[235.77601623535156, 427.864501953125, 321.7920227050781, 617.2275390625]], "labels": ["tree"]}'}
]
Node : evaluation
Activity : answer_assessment
The consequence incorporates conflicting info. 
The primary half states that there aren't any bushes within the picture, whereas the second half gives a bounding field and label indicating {that a} tree is current. 
This inconsistency means the person's query is just not adequately answered.

The agent then tries a number of occasions to restructure the plan, however Florence2 insists on producing a bounding field for “tree”, which the reply evaluation nodes all the time catches as inconsistent. It is a higher consequence than the Qwen2 agent, however factors to a broader challenge of false positives with Florence2. This may very well be addressed by having the routing node consider the plan after each step after which solely name Florence2 if completely obligatory.

With the fundamental constructing blocks in place, this method is ripe for experimentation, iteration and enchancment and I’ll proceed so as to add to the repo over the approaching weeks. For now although, this text is lengthy sufficient!

Thanks for making it to the top and I hope the challenge right here prompts some inspiration in your personal initiatives! The orchestration of a number of specialist fashions inside agent frameworks is a robust and more and more accessible method to placing LLMs to work on complicated duties. Clearly there’s nonetheless a variety of room for improvI for one sit up for seeing how concepts on this discipline develop over the approaching yr.

Source link

key value kv caching mistral transformers xformers

Hands-On Delivery Routes Optimization (TSP) with AI, Using LKH and Python | by Piero Paialunga | Jan, 2025

How To: Forecast Time Series Using Lags | by Haden Pelletier | Jan, 2025

Why Dutch Bros’ People-First Approach is a Blueprint for Success

New Coin Listing – Sealana Crypto Presale Hits $5 Million, 24 Hours Left

Financial Peace University vs. True Financial Freedom vs. Crown Financial MoneyLife

Nigeria not an easy place for startups

Best AI Nude Generators Revealed (2024)

Our Picks

Armoured tank, roadblocks in Benin as election commences

Barack Obama takes stage for Kamala Harris at Democratic convention | US Election 2024 News

Understanding Collaborative AI Agents in Practice

Most Popular

Why Dutch Bros’ People-First Approach is a Blueprint for Success

New Coin Listing – Sealana Crypto Presale Hits $5 Million, 24 Hours Left

Financial Peace University vs. True Financial Freedom vs. Crown Financial MoneyLife

A Multimodal AI Assistant: Combining Local and Cloud Models | by Robert Martin-Short | Jan, 2025

Related Posts