Large language models (LLMs) have marked a elementary shift in AI analysis and improvement. Nevertheless, regardless of their broader impacts, they’re nonetheless essentially restricted.
Particularly, LLMs can solely course of and generate textual content, making them blind to different modalities similar to pictures, video, audio, and extra. It is a main limitation since some duties depend on non-text information, e.g., analyzing engineering blueprints, studying physique language or speech tonality, and decoding plots and infographics.
This has sparked efforts towards increasing LLM performance to incorporate a number of modalities.
A Multimodal Mannequin (MM) is an AI system that may course of a number of information modalities as enter or output (or each) [1]. Beneath are just a few examples.
- GPT-4o — Enter: textual content, pictures, and audio. Output: textual content.
- FLUX — Enter: textual content. Output: pictures.
- Suno — Enter: textual content. Output: audio.
Whereas there are a number of methods to create fashions that may course of a number of information modalities, a current line of analysis seeks to make use of LLMs because the core reasoning engine of a multimodal system [2]. Such fashions are referred to as multimodal massive language fashions (or massive multimodal fashions) [2][3].
One good thing about utilizing present LLM as a place to begin for MMs is that they’ve demonstrated a powerful means to amass world information by large-scale pre-training, which may be leveraged to course of ideas showing in non-textual representations.
Right here, I’ll concentrate on multimodal fashions developed from an LLM. Three fashionable approaches are described beneath.
- LLM + Instruments: Increase LLMs with pre-built elements
- LLM + Adapters: Increase LLMs with multi-modal encoders or decoders, that are aligned by way of adapter fine-tuning
- Unified Fashions: Develop LLM structure to fuse modalities at pre-training
The only option to make an LLM multimodal is by including exterior modules that may readily translate between textual content and an arbitrary modality. For instance, a transcription mannequin (e.g. Whisper) may be linked to an LLM to translate enter speech into textual content, or a text-to-image mannequin can generate pictures primarily based on LLM outputs.
The important thing good thing about such an strategy is simplicity. Instruments can rapidly be assembled with none extra mannequin coaching.
The draw back, nonetheless, is that the standard of such a system could also be restricted. Identical to when taking part in a sport of phone, messages mutate when handed from individual to individual. Data could degrade going from one module to a different by way of textual content descriptions solely.
One option to mitigate the “phone drawback” is by optimizing the representations of latest modalities to align with the LLM’s inside idea house. For instance, guaranteeing a picture of a canine and the outline of 1 look just like the LLM.
That is attainable by using adapters, a comparatively small set of parameters that appropriately translate a dense vector illustration for a downstream mannequin [2][4][5].
Adapters may be educated utilizing, for instance, image-caption pairs, the place the adapter learns to translate a picture encoding right into a illustration appropriate with the LLM [2][4][6]. One option to obtain that is by way of contrastive studying [2], which I’ll talk about extra within the subsequent article of this collection.
The advantages of utilizing adapters to enhance LLMs embody higher alignment between novel modality representations in a data-efficient approach. Since many pre-trained embedding, language, and diffusion fashions can be found in at this time’s AI panorama, one can readily fuse fashions primarily based on their wants. Notable examples from the open-source neighborhood are LLaVA, LLaMA 3.2 Imaginative and prescient, Flamingo, MiniGPT4, Janus, Mini-Omni2, and IDEFICS [3][5][7][8].
Nevertheless, this information effectivity comes at a value. Identical to how adapter-based fine-tuning approaches (e.g. LoRA) can solely nudge an LLM up to now, the identical holds on this context. Moreover, pasting varied encoders and decoders to an LLM could lead to overly difficult mannequin architectures.
The ultimate option to make an LLM multimodal is by incorporating a number of modalities on the pre-training stage. This works by including modality-specific tokenizers (somewhat than pre-trained encoder/decoder fashions) to the mannequin structure and increasing the embedding layer to accommodate new modalities [9].
Whereas this strategy comes with considerably better technical challenges and computational necessities, it allows the seamless integration of a number of modalities right into a shared idea house, unlocking higher reasoning capabilities and efficiencies [10].
The preeminent instance of this unified strategy is (presumably) GPT-4o, which processes textual content, picture, and audio inputs to allow expanded reasoning capabilities at sooner inference instances than earlier variations of GPT-4. Different fashions that observe this strategy embody Gemini, Emu3, BLIP, and Chameleon [9][10].
Coaching these fashions sometimes entails multi-step pre-training on a set of (multimodal) duties, similar to language modeling, text-image contrastive studying, text-to-video technology, and others [7][9][10].
With a fundamental understanding of how LLM-based multimodal fashions work beneath the hood, let’s see what we will do with them. Right here, I’ll use LLaMA 3.2 Imaginative and prescient to carry out varied image-to-text duties.
To run this instance, download Ollama and its Python library. This allows the mannequin to run domestically i.e. no want for exterior API calls.
The instance code is freely accessible on GitHub.