VLM Architectures
repo: gokayfem/awesome-vlm-architectures
category: Theory
related: Deep Learning Β· Computer Vision
ποΈβπ¨οΈAwesome VLM Architectures
| DualView
Vision-Language Models (VLMs) feature a multimodal architecture that processes image and text data simultaneously. They can perform Visual Question Answering (VQA), image captioning and Text-To-Image search kind of tasks. VLMs utilize techniques like multimodal fusing with cross-attention, masked-language modeling, and image-text matching to relate visual semantics to textual representations. This repository contains information on famous Vision Language Models (VLMs), including details about their architectures, training procedures, and the datasets used for training. Click to expand for further details for every architecture
- π <a href="https://github.com/gokayfem/ComfyUI_VLM_nodes">Visit my other repo to try Vision Language Models on ComfyUI</a>
Contents
Models
LLaVA | LLaVA 1.5 | LLaVA 1.6 | PaliGemma | PaliGemma 2 | AIMv2 | Apollo | ARIA | EVE | EVEv2 | Janus-Pro | LLaVA-CoT | LLM2CLIP | Maya | MiniMax-01 | NVLM | OmniVLM | Pixtral 12B | Sa2VA | Tarsier2 | UI-TARS | VideoChat-Flash | VideoLLaMA 3 | Llama 3.2-Vision | SmolVLM | IDEFICS | IDEFICS2 | IDEFICS3-8B | InternLM-XComposer2 | InternLM-XComposer2-4KHD | InternLM-XComposer-2.5 | InternVL 2.5 | DeepSeek-VL | DeepSeek-VL2 | MANTIS | Qwen-VL | Qwen2-VL | Qwen2.5-VL | moondream1 | moondream2 | Moondream-next | SPHINX-X | BLIP | BLIP-2 | xGen-MM (BLIP-3) | InstructBLIP | KOSMOS-1 | KOSMOS-2 | ConvLLaVA | Parrot | OMG-LLaVA | EVLM | SlowFast-LLaVA | Nous-Hermes-2-Vision - Mistral 7B | TinyGPT-V | CoVLM | GLaMM | COSMO | FireLLaVA | u-LLaVA | MoE-LLaVA | BLIVA | MobileVLM | FROZEN | Flamingo | OpenFlamingo | PaLI | PaLI-3 | PaLM-E | MiniGPT-4 | MiniGPT-v2 | LLaVA-Plus | BakLLaVA | CogVLM | CogVLM2 | Ferret | Fuyu-8B | OtterHD | SPHINX | Eagle 2 | EAGLE | VITA | LLaVA-OneVision | MiniCPM-o-2.6 | MiniCPM-V | INF-LLaVA | Florence-2 | MULTIINSTRUCT | MouSi | LaVIN | CLIP | MetaCLIP | Alpha-CLIP | GLIP | ImageBind | SigLIP | ViT
Tools
| Tool | Description |
|---|---|
| DualView | Free side-by-side comparison tool for VLM outputs, images, videos, and AI prompts |
Architectures
LLaVA: Large Language and Vision Assistant - Visual Instruction Tuning
LLaVA seamlessly integrates a pre-trained language model (Vicuna) with a visual encoder (CLIP) using a simple linear layer, creating a robust architecture capable of effectively processing and understanding language-image instructions.
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
<p align="center">
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/722f0fbb-ea52-4a8a-ab1e-bec45ca7d04f" />
</p>
<details>
<summary>βΉοΈ <i>More Information</i></summary>
LLaVA: At the heart of LLaVA's architecture is the fusion of a pre-trained language model with a visual model, specifically designed to process and understand language-image instruction data effectively. This integration enables LLaVA to leverage the distinct strengths of both models, employing the CLIP visual encoder for robust image feature extraction and the Vicuna language model for intricate language instruction processing. A noteworthy feature of this architecture is the use of a simple linear layer that bridges image features to the word embedding space, facilitating a seamless alignment between visual and linguistic representations. The training methodology of LLaVA is meticulously structured into a two-stage instruction-tuning procedure. Initially, the model undergoes pre-training focused on feature alignment, utilizing a carefully filtered dataset to synchronize image features with LLM word embeddings. Subsequently, the model is fine-tuned end-to-end on tailored tasks such as multimodal chatbot functionalities and Science QA, with the aim of refining its instruction-following prowess. This sophisticated training regimen is underpinned by the use of multimodal instruction-following data generated via GPT-4, converting image-text pairs into formats conducive to instruction-following tasks. The alignment of text and image data is innovatively achieved through a trainable projection matrix, converting visual features into language embedding tokens within a unified dimensional space, thereby enhancing the model's ability to encode vision and text cohesively.The datasets deployed for LLaVA's training and evaluation are strategically selected to bolster its multimodal capabilities. The Filtered CC3M dataset serves as the foundation for pre-training, aligning visual and language features, while the LLaVA-Instruct-158K dataset generated using GPT-4 is pivotal for fine-tuning the model on diverse multimodal tasks. Additionally, the ScienceQA dataset plays a critical role in assessing LLaVA's proficiency in multimodal reasoning tasks, demonstrating the model's comprehensive training and its potential to significantly advance the field of multimodal interaction and understanding. </details>
LLaVA 1.5: Improved Baselines with Visual Instruction Tuning
LLaVA 1.5 enhances its multimodal understanding by replacing its initial linear projection with a more powerful multi-layer perceptron (MLP), enabling a deeper integration of visual features from CLIP-ViT-L-336px and linguistic data.
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
<p align="center">
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/c7112b75-3b86-48a2-9c0f-f1dc1dc6ee06" />
</p>
<details>
<summary>βΉοΈ <i>More Information</i></summary>
LLaVA 1.5: This iteration introduces a refined architecture that incorporates a CLIP-ViT-L-336px vision encoder alongside a multi-layer perceptron (MLP) projection layer. This combination not only boosts the model's data efficiency but also its performance across various benchmarks, showcasing a leap in multimodal understanding. The architecture's core components, the CLIP-ViT-L for visual encoding and the MLP for vision-language cross-modal connection, work synergistically to enhance the model's capacity to integrate and interpret visual and linguistic inputs.Training methods have been optimized in LLaVA 1.5 to achieve unprecedented performance on 11 benchmarks, utilizing a two-stage approach that emphasizes efficient feature alignment and fine-tuning with VQA data specifically tailored for academic tasks. The paper highlights a shift towards more sophisticated multimodal alignment techniques, replacing the original linear projection with a more powerful MLP vision-language connector. This strategic improvement facilitates a deeper and more nuanced integration of visual and linguistic data. Moreover, the adoption of an MLP-based vision-language connector for alignment fusion methods further strengthens the model's ability to merge visual and textual representations effectively, ensuring closer alignment in the embedding space.The utilization of datasets such as VQA-v2, GQA, and other academic-task-oriented VQA datasets, enriched with OCR and region-level perception data, underscores the model's enhanced visual understanding and reasoning capabilities. These datasets play a crucial role in elevating LLaVA 1.5's performance, enabling it to set new standards with academic-task-oriented data. Through these advancements, LLaVA 1.5 not only pushes the boundaries of multimodal learning but also sets a new benchmark for future research in the field. </details>
LLaVA 1.6: LLaVA-NeXT Improved reasoning, OCR, and world knowledge
LLaVA-NeXT advances on LLaVA-1.5 by incorporating high-resolution image processing, enhancing visual reasoning and OCR capabilities, while maintaining a data-efficient design through knowledge transfer from its predecessor and a refined training process.
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee
<p align="center">
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/032ef144-ec10-41da-80a1-2cecd66c86ee" />
</p>
<details>
<summary>βΉοΈ <i>More Information</i></summary>
LLaVA-NeXT: Represents a significant step forward in the evolution of large language models with visual capabilities, building upon the foundations laid by LLaVA-1.5. This model introduces several enhancements aimed at improving image resolution, visual reasoning, optical character recognition (OCR), and the integration of world knowledge, all while retaining the minimalist and data-efficient design of its predecessor. The architecture of LLaVA-NeXT is optimized for high performance, supporting input image resolutions up to 672x672, 336x1344, and 1344x336 pixels. This improvement facilitates a more detailed visual perception, which, coupled with an enhanced visual instruction tuning data mixture, significantly bolsters the model's reasoning and OCR capabilities. Furthermore, LLaVA-NeXT achieves efficient deployment through the use of SGLang, a feature that underscores its design's focus on performance and data efficiency.Training LLaVA-NeXT requires less than 1 million visual instruction tuning samples, leveraging the pre-trained connector from LLaVA-1.5 for efficient knowledge transfer. The training process, remarkably swift, utilizes 32 A100 GPUs and completes in approximately one day, a testament to the model's efficient design and deployment strategy. The alignment techniques in LLaVA-NeXT are particularly noteworthy, utilizing high-resolution images and a high-quality data mixture to enhance the model's capabilities in visual conversation and instruction following. The model's use of dynamic high-resolution techniques, known as 'AnyRes', allows for effective handling of images with varying resolutions, improving the model's overall visual understanding.The datasets employed in training LLaVA-NeXT, including LAION-GPT-V, ShareGPT-4V, DocVQA, SynDog-EN, ChartQA, DVQA, and AI2D, are meticulously chosen to augment the model's visual reasoning, OCR capabilities, and comprehension of charts and diagrams. This strategic selection aims to elevate the model's performance across a wide range of multimodal tasks, emphasizing its enhanced ability to process and understand complex visual information. Through these improvements, LLaVA-NeXT sets a new benchmark for models at the intersection of language and vision, offering unprecedented capabilities in visual reasoning, OCR, and the application of world knowledge in multimodal contexts. </details>
PaliGemma: A Versatile and Transferable 3B Vision-Language Model
PaliGemma is a compact, open-source vision-language model designed to be easily transferable to a diverse range of tasks. It combines a powerful SigLIP image encoder with the Gemma-2B language model, achieving strong performance on over 40 diverse tasks, including standard VLM benchmarks, remote-sensing, and segmentation. PaliGemma is pretrained using a multi-stage approach, focusing on maximizing the density of learning signal and providing different checkpoints with varying image resolutions. This versatile foundation model is easily fine-tuned for specific tasks and serves as a valuable tool for researchers and practitioners exploring the capabilities of VLMs.
Lucas Beyer, Andreas Steiner, AndrΓ© Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bauer, Matko BoΕ‘njak, Xi Chen, Matthias Minderer, Paul Voigtlaender, Ioana Bica, Ivana Balazevic, Joan Puigcerver, Pinelopi Papalampidi, Olivier Henaff, Xi Xiong, Radu Soricut, Jeremiah Harmsen, Xiaohua Zhai
<p align="center"> <img src="https://github.com/user-attachments/assets/186371d0-6861-4b68-b32e-fee77cc75ef2" /> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
PaliGemma stands out as a highly versatile and transferable 3-billion parameter Vision-Language Model (VLM) meticulously designed for broad applicability across a wide spectrum of visual-language tasks. Its foundation lies in the integration of two powerful components: a SigLIP-So400m vision encoder, known for its exceptional performance despite its compact size, and the Gemma-2B language model, a pretrained autoregressive decoder-only model from the Gemma family. This combination enables PaliGemma to effectively process and understand both visual and textual information, making it adept at handling tasks ranging from image captioning and visual question answering to more specialized tasks like remote-sensing and segmentation. PaliGemma's architecture is streamlined and efficient. It uses a simple linear projection to align the visual features extracted by the SigLIP encoder with the vocabulary tokens of the Gemma language model, enabling seamless fusion of the two modalities. A key aspect of PaliGemma's training is the emphasis on "density of learning signal," prioritizing a broad range of skills and knowledge over achieving high zero-shot performance. This approach involves a multi-stage pretraining process that starts with unimodal pretraining of individual components using publicly available checkpoints, followed by extensive multimodal pretraining on a diverse mixture of large-scale vision-language tasks. Notably, PaliGemma deviates from the common practice of freezing the image encoder during pretraining, allowing it to learn spatial and relational understanding from complex tasks like captioning. To further enhance its capabilities, PaliGemma undergoes a resolution increase stage, where it is trained on higher-resolution images, enabling it to handle tasks that benefit from finer visual details. This multi-stage pretraining process results in a family of three PaliGemma checkpoints at varying image resolutions (224px, 448px, and 896px), each pretrained with broad visual knowledge. These checkpoints serve as strong base models that can be easily transferred to specific downstream tasks. PaliGemma's transferability is demonstrated through its impressive performance on over 30 academic benchmarks, including those involving multiple images, such as NLVR2 and short-video understanding tasks. The model's ability to adapt quickly to new tasks with minimal fine-tuning highlights its versatility and makes it a valuable tool for exploring and advancing the capabilities of VLMs. Furthermore, the model's open-source nature, along with its straightforward architecture and training recipe, encourages further research and experimentation within the VLM community, driving progress towards more powerful and general-purpose multimodal AI systems. </details>
PaliGemma 2: A Family of Versatile VLMs for Transfer
PaliGemma 2 is an upgraded family of open Vision-Language Models (VLMs) based on Gemma 2 language models, combined with the SigLIP-So400m vision encoder. It offers models in three sizes (3B, 10B, 28B) and three resolutions (224pxΒ², 448pxΒ², 896pxΒ²), trained in multiple stages for broad knowledge transfer. PaliGemma 2 achieves state-of-the-art results on various tasks, including OCR-related challenges like table/molecular/music score recognition, and long-form captioning.
Andreas Steiner, AndrΓ© Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard, Ibrahim Alabdulmohsin, Lucas Beyer and Xiaohua Zhai
<p align="center"> <img src="https://github.com/user-attachments/assets/4ce402be-d644-4143-a57c-9e7f4d811d95" width="600"/> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
PaliGemma 2 closely follows the architecture of its predecessor, PaliGemma. It uses a pre-trained SigLIP-So400m vision encoder. The embeddings from this encoder are mapped to the input space of the Gemma 2 language model using a linear projection. The combined visual and text embeddings are then fed into the Gemma 2 model, which autoregressively generates the output. The model comes in three size variants (2B, 9B, and 27B parameters in the Gemma 2 component, corresponding to 3B, 10B, and 28B total parameters) and is trained at three resolutions (224x224, 448x448, and 896x896 pixels). This allows for analysis of the interplay between model size, resolution, and transfer performance. The input image gets concatenated with the input text tokes and Gemma 2 autoregressively completes this prefix with an answer. PaliGemma 2's training follows a three-stage approach, similar to the original PaliGemma: Stage 1: The pre-trained SigLIP-So400m and Gemma 2 checkpoints are combined and trained jointly on a multimodal task mixture of 1 billion examples. The image resolution is 224pxΒ². Stage 2: Training continues for 50 million examples at 448pxΒ² resolution, then for 10 million examples at 896pxΒ². Tasks benefiting from higher resolution are upweighted. Stage 3: Fine-tuning the checkpoints from stage 1 or 2 on the target tasks. The training data mixture includes captioning, grounded captioning, OCR, visual question answering (VQA), detection, and instance segmentation. Notably, the training data relies heavily on machine-generated labels from publicly available specialist models, avoiding the use of large commercial VLMs for label generation. Gemma 2 Language Models: The core upgrade is the use of the more recent and capable Gemma 2 family of language models, replacing the original Gemma model in PaliGemma. Resolution and Model Size Scaling: PaliGemma 2 systematically explores the impact of both image resolution and language model size on transfer performance. This is a key contribution, as most prior work did not jointly study these factors with consistent training recipes. </details>
AIMv2: Multimodal Autoregressive Pre-training of Large Vision Encoders
AIMv2 is a family of generalist vision encoders that autoregressively generates both image patches and text tokens, achieving state-of-the-art performance in multimodal image understanding and strong results in vision benchmarks like localization, grounding, and classification, demonstrating scalability and efficiency.
Enrico Fini, Mustafa Shukor, David Haldimann, Sai Aitharaju, Alexander Toshev, Marcin Eichner, Moin Nabi, Xiujun Li, Philipp Dufter, Michal Klein, Victor G. Turrisi da Costa, Louis BΓ©thune, Zhe Gan, Alaaeldin El-Nouby
<p align="center"> <img src="https://github.com/user-attachments/assets/c89a0be9-8743-4800-8d3c-ec51a4c99f4d" width="600"/> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
AIMv2 (Autoregressive Image Models v2) introduces a novel pre-training method for large-scale vision encoders that extends autoregressive pre-training to a multimodal setting, encompassing both images and text. The core architecture pairs a Vision Transformer (ViT) encoder with a causal multimodal decoder. The vision encoder processes raw image patches (using prefix attention), while the multimodal decoder autoregressively generates both image patches (using pixel MSE loss) and text tokens (using cross-entropy loss). Crucially, image patches and text tokens are treated as a single, unified sequence. This allows the model to learn a joint representation of visual and textual information. The image is always prepended to the beginning of text sequence. The training process is streamlined and efficient. It resembles that of AIM and LLMs, relying solely on the autoregressive objective. There are no specialized inter-batch communication methods or excessively large batch sizes are required. This contrasts with contrastive methods (e.g., CLIP, SigLIP), which are often more challenging to train and scale. The training data consists of a mixture of publicly available (DFN-2B, COYO) and proprietary datasets (HQITP), comprising both alt-text and synthetic captions. AIMv2 demonstrates strong scaling properties, consistently improving performance with increased data or model parameters. The model family includes variants ranging from 300 million to 3 billion parameters. A key optimization is the use of prefix attention within the vision encoder, enabling bidirectional attention during inference without fine-tuning. Other architectural choices include the incorporation of SwiGLU and RMSNorm, inspired by recent successes in language modeling. AIMv2 excels in a variety of tasks. It performs favorably on multimodal understanding benchmarks compared to state-of-the-art vision-language pre-trained methods . It also exhibits strong performance on open-vocabulary object detection and referring expression comprehension, surpassing DINOv2. Additionally, it achieves impressive recognition performance with a frozen trunk. The model supports native image resolution and adaptation to zero-shot recognition, demonstrating its flexibility. Post-training strategies, including high-resolution adaptation, further enhance the model's capabilities. Ablation studies demonstrate the importance of joint image and text modeling, validate design choices, and explore scaling characteristics. </details>
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Apollo is a state-of-the-art family of Large Multimodal Models (LMMs) designed for video understanding, achieving superior performance across different model sizes by leveraging "Scaling Consistency" and exploring video-specific aspects like sampling, architectures, data composition, and training schedules. The 7B model is start of the art, and Apollo-3B outperforms most existing 7B models.
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia
<p align="center"> <img src="https://github.com/user-attachments/assets/9222064a-d7a3-4e6b-a14d-bc9a5c679450" width="600" /> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
Apollo leverages the Qwen2.5 series of Large Language Models (LLMs) with 1.5B, 3B, and 7B parameters. The key architectural innovation is the combination of a SigLIP-SO400M image encoder and an InternVideo2 video encoder. Features from both encoders are interpolated and concatenated channel-wise before being fed into a Perceiver Resampler, which outputs 32 tokens per frame. This combination was empirically found to be superior to other encoder choices. The model uses a 3-stage training approach. Critically, the paper introduces the concept of "Scaling Consistency," demonstrating that design decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. This allows for more efficient experimentation. The paper also advocates for frames-per-second (fps) sampling during training, as opposed to uniform frame sampling, and demonstrates its superiority. The optimal number of tokens is 8-32 per frame. It also includes a curated benchmark, ApolloBench, that reduces evaluation time by 41x compared to existing benchmarks while maintaining high correlation and focusing on temporal reasoning and perception. The exploration also includes Token Resampling showing that Perceiver resampling has a good performace. Token Integration is also discussed: Adding tokens (text, learned, etc.) between the video tokens derived from different frames or clips is sufficient for efficient token integration. Training Stages is also disscussed, concluding that progressively unfreezing the different components in different stages leads to superior model training dynamics. Finally, training the Video Encoder is discussed. The paper concludes that Finetuning video encoders on only video data further improves overall performance, especially on reasoning and domain-specific tasks. Data Composition is also studied. It concludes that Data mixture matters, and including a moderate amount of text data and maintaining a slight video-heavy mix leads to optimal performance.
</details>
ARIA: An Open Multimodal Native Mixture-of-Experts Model
ARIA is an open-source, multimodal native Mixture-of-Experts (MoE) model designed to seamlessly integrate and understand diverse modalities like text, code, images, and video, achieving state-of-the-art performance in its class. It features a fine-grained MoE decoder for efficient parameter utilization, a lightweight visual encoder, and a 4-stage training pipeline that builds capabilities in language understanding, multimodal comprehension, long context handling, and instruction following.
Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, Chongyan Zhu, Xiaoyi Ren, Chao Li, Yifan Ye, Peng Liu, Lihuan Zhang, Hanshu Yan, Guoyin Wang, Bei Chen, Junnan Li
<p align="center"> <img src="https://github.com/user-attachments/assets/efe4a7ba-756a-4da8-b261-5a0093f28b03" width="600"/> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
ARIA's architecture is centered around a fine-grained Mixture-of-Experts (MoE) decoder, which is more efficient than traditional dense decoders. This MoE approach activates 3.5B parameters per text token and 3.9B per visual token, out of a total of 24.9B parameters. The model uses 66 experts in each MoE layer, with 2 shared across all inputs for common knowledge, and 6 activated per token by a router. The visual encoder is a lightweight (438M parameter) Vision Transformer (ViT) combined with a projection module. The ViT processes images at various resolutions (medium, high, and ultra-high), preserving aspect ratios. The projection module uses cross-attention and an FFN layer to convert image embeddings into visual tokens, which are then integrated with text tokens by the MoE. ARIA's training uses a 4-stage pipeline: (1) Language pre-training (6.4T text tokens, 8K context window); (2) Multimodal pre-training (400B multimodal tokens, including interleaved image-text, synthetic image captions, document transcriptions and QA, video captions and QA); (3) Multimodal long-context pre-training (extending context to 64K tokens); and (4) Multimodal post-training (instruction following with 20B tokens). The data curation process is rigorous, incorporating techniques like de-duplication, quality filtering, and data clustering. The training infrastructure avoids pipeline parallelism, using a combination of expert parallelism and ZeRO-1 data parallelism, which contributes to efficient training without the need for tensor parallelism. A load-balancing loss and z-loss are used to stabilize training. The paper demonstrates that, despite having modality-generic experts, ARIA naturally develops expert specialization during pre-training. Analysis of expert activation shows distinct visual specialization in several layers, particularly for image, video, and PDF content. ARIA also shows excellent performance in handling long-context multimodal data, surpassing other open models and competing favorably with proprietary models in tasks like long video and document understanding. </details>
EVE: Unveiling Encoder-Free Vision-Language Models
EVE is an encoder-free vision-language model (VLM) that directly processes images and text within a unified decoder-only architecture, eliminating the need for a separate vision encoder. It achieves competitive performance with encoder-based VLMs of similar size on multiple vision-language benchmarks using only 35M publicly accessible data, with the model efficiently handling high-resolution images with arbitrary aspect ratios.
Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang
<p align="center">
<img src="https://github.com/user-attachments/assets/c10e987d-9e11-41d7-968c-617b60d3b0bd" width="600"/>
</p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
EVE (Encoder-free Vision-language modEl): This model distinguishes itself by completely removing the vision encoder component typically found in VLMs. Instead, it directly integrates visual information into a decoder-only architecture (based on Vicuna-7B). This is achieved through a novel Patch Embedding Layer (PEL) that processes image patches directly, combined with a Patch Aligning Layer (PAL) that facilitates learning from a pre-trained vision encoder (CLIP-ViT-L/14) without updating the encoder itself. Crucially, EVE does not use a traditional image encoder during inference. The PEL uses a convolution layer and average pooling to create 2D feature maps from the input image. It then employs cross-attention (CA1) within a limited receptive field to enhance these features. A special <CLS> token provides a holistic view of each patch feature, and a learnable newline token <SPL> is inserted after each row of patch features to represent the 2D structure. The PAL aligns EVE's patch features with those from a frozen, pre-trained vision encoder (CLIP-ViT-L/14). This is done hierarchically, aggregating features across multiple layers of the decoder and using a layer-wise cross-attention (CA3) mechanism. A Mean Squared Error (MSE) loss between EVE's features and the vision encoder's features encourages alignment. This "implicit" supervision from the vision encoder improves visual understanding. Importantly, PAL is only used during training, not inference. The training process occurs in three stages: LLM-guided Pre-training: Only the PEL and PAL are trained, aligning the visual features with the frozen LLM (Vicuna-7B). This stage uses a subset (16M) of the total training data. Generative Pre-training: The entire model (including the LLM) is trained, using the full 33M dataset. Both text prediction (cross-entropy loss) and visual alignment (MSE loss) are used. Supervised Fine-tuning: The entire model is fine-tuned on instruction-following datasets (LLaVA-mix-665K and others). The key innovations that allow EVE to work well without a vision encoder are: LLM-Centric Pre-alignment: Stage 1 is critical for preventing model collapse and accelerating convergence. Aligning visual features before fully training the LLM is essential. Vision Recognition Capability via Extra Supervision: The PAL provides supervision from a pre-trained vision encoder during training, which enhances visual understanding without requiring the encoder during inference. Flexible Input Handling: The architecture naturally handles images of arbitrary aspect ratios and resolutions, without needing resizing, padding, or partitioning. No reliance on vision encoder: The image are directly input into the LLM model.
EVE uses a curated dataset of 33M publicly available image-text pairs for pre-training, with captions generated by Emu2 and LLaVA-1.5. Supervised fine-tuning utilizes datasets like LLaVA-mix-665K, AI2D, DocVQA, and others.
</details>
Okay, let's break down the information from the provided paper on EVEv2 and create a feature extraction similar to your examples.
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
EVEv2 represents a significant advancement in encoder-free vision-language models (VLMs), addressing limitations of previous approaches by introducing a "Divide-and-Conquer" architecture that maximizes scaling efficiency, reduces inter-modality interference, and achieves strong performance with superior data efficiency.
Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang
<p align="center">
<img src="https://github.com/user-attachments/assets/23a33fe6-d4c5-4a9d-b45f-f5612f7848a5" width="600"/>
</p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
EVEv2 departs from the traditional encoder-based VLM approach. Instead of relying on a pre-trained vision encoder (like CLIP), it builds visual perception directly within a decoder-only Large Language Model (LLM). Key architectural features include: Divide-and-Conquer: This is the core innovation. Instead of mixing visual and textual information throughout the entire LLM, EVEv2 introduces modality-specific components. This means separate attention matrices (query, key, value), Layer Normalization layers, and Feed-Forward Networks for visual and textual tokens. This reduces interference and allows for more efficient learning. It's a fully sparse, decoder-only architecture. Patch Embedding Layer: A minimalist patch embedding layer is learned from scratch. This avoids the inductive biases of pre-trained vision encoders. It uses two convolutional layers (Conv1 and Conv2) to process image patches. Lossless Encoding: Unlike some encoder-free models that use discrete tokenization (which can lose information), EVEv2 aims for lossless encoding of visual information. LLM Adaptation: The architecture is designed for seamless adaptation to existing LLMs. The paper experiments with Vicuna-7B and Qwen2-7B. Multi-Stage Training: A four-stage training process is used: LLM-guided Pre-aligning: Only the patch embedding layer is trained, using re-captioned web data (EVE-recap-10M). The LLM is frozen. This establishes a basic alignment between visual and textual representations. Vision Perception Learning: Vision layers within the LLM are trained, using progressively larger datasets and image resolutions. The LLM weights are still frozen. Vision-Text Fully alligning: The entire network is update. Supervised Fine-tuning (SFT): The entire model is fine-tuned on question-answering and instruction-following datasets. DenseFusion++: A new, efficient captioning engine is introduced to generate high-quality image-text pairs for training. This is crucial for building strong visual perception from scratch. It leverages multiple vision experts. Data Efficiency: A key focus of the research is demonstrating that EVEv2 can achieve strong performance with less data than comparable encoder-based models, thanks to its efficient architecture. </details>
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Janus-Pro significantly improves upon the original Janus model by optimizing the training strategy, expanding the training data, and scaling up the model size, resulting in enhanced multimodal understanding, text-to-image instruction-following, and generation stability.
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan
<p align="center"> <img src="https://github.com/user-attachments/assets/657b0f2a-7a0e-4aed-a214-a33485990790" width="600"/> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
Janus-Pro maintains the core architecture of Janus, which decouples visual encoding for multimodal understanding and generation. It uses a unified autoregressive transformer but employs separate encoders for understanding (SigLIP) and generation (VQ tokenizer). The understanding encoder extracts semantic features, flattened and mapped to the LLM's input space via an "understanding adaptor." The generation encoder converts images to discrete IDs, flattened and mapped via a "generation adaptor." These feature sequences are concatenated and fed to the LLM. The model includes a built-in prediction head (from the LLM) and a randomly initialized prediction head for image generation. The key improvements in Janus-Pro lie in three areas: Optimized Training Strategy: Janus-Pro uses a three-stage training process. Stage I: Focuses on training the adaptors and image head with longer training on ImageNet, improving parameter initialization. Stage II: Unified pretraining, updating all components except the understanding and generation encoders. Crucially, it removes ImageNet data from this stage and uses only "normal" text-to-image data, improving efficiency. Stage III: Supervised fine-tuning, further updating the understanding encoder. The data ratio (multimodal:text:text-to-image) is adjusted from 7:3:10 to 5:1:4, improving multimodal understanding without sacrificing generation. Data Scaling: Janus-Pro significantly expands the training data. Multimodal Understanding: Adds ~90 million samples from sources like DeepSeek-VL2, including image captions (YFCC), table/chart/document understanding (Docmatix), MEME understanding, and Chinese conversational data. Visual Generation: Adds ~72 million synthetic aesthetic data samples, balancing real and synthetic data 1:1 during unified pretraining. This improves generation stability and aesthetic quality. Model Scaling: Janus-Pro scales up from 1.5B to 7B LLM parameters (DeepSeek-LLM). This significantly improves convergence speed for both understanding and generation. The training uses a sequence length of 4096, SigLIP-Large-Patch16-384 for understanding, and a VQ tokenizer with a codebook of 16,384 for generation. Adaptors are two-layer MLPs. Training is performed with HAI-LLM, a distributed training framework. Evaluation is conducted on benchmarks like GQA, MME, SEED, MMB, MM-Vet, MMMU (for understanding) and GenEval, DPG-Bench (for generation). Janus-Pro achieves state-of-the-art results among unified multimodal models, demonstrating significant improvements in both multimodal understanding and text-to-image generation. </details>
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
LLaVA-CoT is a novel Vision-Language Model (VLM) designed to perform autonomous, multi-stage reasoning, enabling it to tackle complex visual question-answering tasks by independently engaging in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation.
Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, Li Yuan
<p align="center"> <img src="https://github.com/user-attachments/assets/1a5e32f0-4ffc-4514-8401-25777c2fac10" width="600"/> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
LLaVA-CoT builds upon the Llama-3.2-Vision model and introduces a structured, four-stage reasoning process: Summary (briefly outlines the task), Caption (describes relevant image parts), Reasoning (detailed analysis), and Conclusion (provides the final answer). Each stage is marked with specific tags (<SUMMARY>, <CAPTION>, <REASONING>, <CONCLUSION>) to maintain clarity. Unlike traditional Chain-of-Thought (CoT) prompting, LLaVA-CoT promotes structured thinking by first organizing the problem and known information, then performing detailed reasoning, and finally deriving a conclusion. The model is trained on the newly compiled LLaVA-CoT-100k dataset. This dataset integrates samples from various visual question answering sources and providing structured reasoning instructions. The dataset contains 99k image and Question answer pairs using GPT-4o to provide details. Data is gathered from general VQA datasets (ShareGPT4V, ChartQA, A-OKVQA, DocVQA, PISC, CLEVR) and Science targeted VQA (AI2D, GeoQA+, ScienceQA, CLEVR-Math). The paper also proposes a novel inference-time stage-level beam search method. This method generates multiple candidate results at each stage of the reasoning process, selecting the best to continue, improving performance and scalability. This contrasts with traditional best-of-N or sentence-level beam search. The entire model is trained using the Supervised-Fine Tuning. </details>
LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation
LLM2CLIP is a fine-tuning approach which integrates Large Language Models (LLMs) with pre-trained CLIP visual encoders. It improves the model by using the LLM's ability to proccess and understant long captions, open-world knowledge.
Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu
<p align="center"> <img src="https://github.com/user-attachments/assets/44d6952e-98ea-4875-bd9c-0a09a683bcbb" width="600"/> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
LLM2CLIP is fine-tuning approach. It integrates LLM (Large Language Models) to already pretrained CLIP visual encoders. The main problem which is tried to be solved is that; LLM's text understanding capability is not reflected on CLIP models. The authors highlight that directly incorporating LLMs into CLIP often fails due to the poor separability of LLM output features. To tackle this, they introduce a two-stage approach. Stage 1: Caption Contrastive (CC) Fine-tuning: The LLM (specifically Llama-3 8B) is fine-tuned using a contrastive learning framework on a dataset of image captions (CC3M). This stage doesn't train for autoregressive capabilities, instead, it is transforming the causal attention to bidirectional, to function it as an encoder. This stage aims to improve the discriminative power of the LLM's output space, making it easier to distinguish between different captions, using supervised SimCSE loss. Stage 2: CLIP Vision Encoder Fine-tuning: The pre-trained CLIP visual encoder is fine-tuned using the CC-fine-tuned LLM, now acting as a "super" text encoder. The LLM's gradients are frozen during this stage to preserve its acquired knowledge and reduce computational cost. Learnable adapters (linear layers) are added after the LLM to facilitate alignment with the CLIP visual encoder. Instead of the typical image-text contrastive loss, a caption-to-caption contrastive framework is used during LLM fine-tuning. This forces the LLM to produce distinct representations for different captions describing the same image. It uses Supervised SimCSE. Makes the model encoder. Freezing the LLM during CLIP fine-tuning is crucial for efficiency and preserving the LLM's knowledge. These adapters bridge the gap between the frozen LLM and the CLIP visual encoder. The method is surprisingly efficient, requiring only a small amount of open-source data (15M or even 3M image-text pairs) and a single epoch of training in some cases. It leverages LoRA (Low-Rank Adaptation) for efficient fine-tuning. LLM2CLIP can effectively leverage dense captions (detailed image descriptions), a known limitation of standard CLIP. Uses ShareCaptioner-modified CC-3M (for CC fine-tuning), Wikitext-103, and a combination of CC-3M, CC-12M, YFCC-15M, and Recaption-1B for CLIP fine-tuning. The paper demonstrates that, after fine-tuning of the output space of the LLM, using LLM has a significant impact and it substantially improves the performance on downstream tasks. </details>
Maya: An Instruction Finetuned Multilingual Multimodal Model
Maya is an open-source Multimodal Multilingual Vision Language Model (mVLM) designed to address the limitations of current VLMs in handling low-resource languages and diverse cultural contexts, achieved by creating a new multilingual image-text pretraining dataset, performing toxicity analysis and mitigation, and fine-tuning for enhanced cultural and linguistic comprehension.
Nahid Alam, Karthik Reddy Kanjula, Bala Krishna S Vegesna, S M Iftekhar Uddin, Drishti Sharma, Abhipsha Das, Shayekh Bin Islam, Surya Guthikonda, Timothy Chung, Anthony Susevski, Ryan Sze-Yin Chan, Roshan Santhosh, Snegha A, Chen Liu, Isha Chaturvedi, Ashvanth.S, Snehanshu Mukherjee, Alham Fikri Aji
<p align="center">
<img src="https://github.com/user-attachments/assets/f413afd9-3eee-4a5e-940a-b148fdf3189b" width="600"/>
</p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
Architecture: Maya builds upon the LLaVA 1.5 framework. It uses the Aya-23 8B model as its Large Language Model (LLM) due to Aya's strong multilingual capabilities (trained on 23 languages). Critically, it replaces the CLIP vision encoder used in LLaVA with SigLIP. This is motivated by SigLIP's superior performance, multilingual support, and ability to handle variable-length image patches (allowing for more flexible input sizes). The visual features from SigLIP (Zv = g(Xv)) are passed through a trainable projection matrix (W, a 2-layer MLP with GELU activation) to align them with the LLM's embedding space, producing visual features Hv. The architecture is fairly standard for this type of model, concatenating visual and textual features for input to the LLM. The training process involves two main phases: pretraining and finetuning. Pretraining: The model is pretrained on a newly created multilingual image-text dataset. This dataset is derived from the English-only LLaVA pretraining dataset (558k image-text pairs) and translated into seven additional languages (Chinese, French, Spanish, Russian, Hindi, Japanese, and Arabic) using a sophisticated translation pipeline. This pipeline uses the Aya 35B model, optimized prompt engineering (determined empirically using BLEU and N-gram scores), and a batch processing approach with quality checks. Crucially, this dataset undergoes toxicity filtering. LLaVAGuard and Toxic-BERT are used to identify and remove toxic image-caption pairs, creating a "toxicity-free" version of the dataset (removing 7,531 toxic images). The pretraining uses a learning rate of 1e-3 and a cosine scheduler. Only the projection matrix is trained during pretraining. Finetuning: The pretrained model is then instruction-tuned using the PALO 150K instruction-tuning dataset (which covers 10 languages). Full finetuning is performed (as opposed to LoRA), with frozen vision encoder and LLM. The core alignment technique is the trainable projection matrix (the 2-layer MLP) that maps the SigLIP visual features into the embedding space of the Aya-23 LLM. This is a simple but effective method, common in many VLMs. The paper explicitly states they did not use more complex alignment techniques like gated soft-attention (Flamingo) or Q-Former (BLIP-2) in this phase, reserving those for future work. Pretraining Dataset: A new multilingual dataset created by translating and filtering the LLaVA pretraining dataset. This dataset is a key contribution of the paper. The translation process and toxicity filtering are described in detail. Instruction Tuning Dataset: PALO 150K instruction-tuning dataset. Evaluation Datasets: PALO multilingual evalution, VizWiz, GQA, ScienceQA, TextVQA, POPE, MMBench, MM-Vet, MME. Multilingual Image-Text Pretraining Dataset: A new dataset of 558,000 images in eight languages. Toxicity Analysis and Mitigation: A thorough analysis of toxicity in the original LLaVA dataset and the creation of a toxicity-free version. This is a significant and novel aspect. Multilingual Model: A model (Maya) that shows improved performance in understanding cultural and linguistic nuances, especially in comparison to models trained primarily on English data. The results show that Maya performs comparably to or better than models of similar size (LLaVA-7B) and often approaches the performance of larger models (PALO-13B) on multilingual benchmarks. The toxicity filtering has a minimal impact on overall performance, suggesting that valuable information isn't lost by removing toxic content. The paper includes both quantitative benchmark results and qualitative examples demonstrating the model's capabilities.
</details>
MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax-01 is a series of large foundation models, including MiniMax-Text-01 and MiniMax-VL-01, that achieve performance comparable to top-tier models (like GPT-4o and Claude-3.5-Sonnet) while offering significantly longer context windows (up to 4 million tokens). It achieves this through a novel architecture incorporating lightning attention (a highly efficient linear attention variant), Mixture of Experts (MoE), and optimized training/inference frameworks.
MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu
<details> <summary>βΉοΈ <i>More Information</i></summary>
Hybrid Attention: The core innovation is the hybrid attention mechanism. It primarily uses "lightning attention" (an I/O-aware implementation of TransNormer linear attention) for efficiency. However, to maintain strong retrieval capabilities, it strategically inserts a standard transformer block with softmax attention after every seven transnormer blocks (with lightning attention). This is a key differentiator from purely linear attention models, which often struggle with retrieval tasks. Mixture of Experts (MoE): To scale the model efficiently, MiniMax-01 employs a Mixture of Experts (MoE) architecture in the feed-forward layers. It has a massive 456 billion total parameters, but only 45.9 billion are activated for each token, using 32 experts with a top-2 routing strategy. This allows for a large model capacity without a corresponding increase in computational cost per token. Vision-Language Model (MiniMax-VL-01): The vision-language model (MiniMax-VL-01) builds upon MiniMax-Text-01 by integrating a lightweight Vision Transformer (ViT) module. It uses a dynamic resolution strategy, resizing input images to various sizes (from 336x336 to 2016x2016) and concatenating features from both resized patches and a standard thumbnail. It does not use pooling or downsampling on the visual features, relying instead on the long-context capabilities of the architecture. Demonstrates the viability of linear attention at a massive scale, achieving performance comparable to top-tier models while significantly extending the context window. Long-Context Capability: Supports context inputs of up to 4 million tokens, with strong performance in long-context evaluations. Efficient Training and Inference Framework: Introduces several novel algorithmic and engineering optimizations to handle the hybrid architecture, MoE, and long contexts efficiently. Pre-training: A meticulously curated corpus incorporating academic literature, books, web content, and programming code. Vision-Language Pre-training (VL-01): A substantial image-caption dataset (694 million unique pairs) and a dataset of 100 million images with fine-grained descriptions. Vision-Language Instruction Data (VL-01): A comprehensive and diverse instruction-based dataset synthesized from a wide array of image-related tasks. Alignment Datasets are also mentioned but are not detailed in the ocr. Hybrid Attention: The core fusion method is the hybrid attention mechanism, which combines the efficiency of lightning attention (linear) with the retrieval capabilities of softmax attention. MoE Routing: The MoE architecture with its top-2 routing strategy allows for selective activation of experts, enhancing model capacity without increasing computational cost per token. A global router is used for load balancing. Vision-Language Fusion (VL-01): Visual features from the ViT are projected into the embedding space of the LLM using a two-layer MLP. The raw, high-dimensional visual features are directly used without pooling or downsampling, leveraging the long-context capabilities of the architecture. Varlen Ring Attention and LASP+: These algorithms enable efficient handling of long, variable-length sequences and data packing during both training and inference. Post-Training and Alignment: Various techniques are used for alignment.
</details>
NVLM: Open Frontier-Class Multimodal LLMs
NVLM 1.0 is a family of multimodal large language models (LLMs) achieving state-of-the-art results on vision-language tasks, rivaling proprietary and open-access models. It demonstrates improved text-only performance after multimodal training and offers a comprehensive comparison of decoder-only and cross-attention-based architectures, introducing a novel hybrid architecture and a 1-D tile-tagging design for high-resolution images.
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
<p align="center">
<img src="https://github.com/user-attachments/assets/da882643-ac1d-4566-8287-cd8da3897a88" width="600"/>
</p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
NVLM (NVIDIA Vision Language Model) introduces a family of models with three primary architectures: NVLM-D (Decoder-only), NVLM-X (Cross-attention-based), and NVLM-H (Hybrid). All models share a common vision pathway, employing a frozen InternViT-6B-448px-V1-5 vision encoder with dynamic high-resolution (DHR) processing. DHR involves dividing input images into tiles (up to 6, with varying aspect ratios) and a downscaled global "thumbnail" tile. These tiles are processed by the vision encoder, and the resulting 1024 tokens per tile are downsampled to 256 via pixel shuffling. NVLM-D (Decoder-only): Connects the vision encoder to the LLM (Qwen2-72B-Instruct or Nous-Hermes-2-Yi-34B) via a 2-layer MLP projector. It introduces a novel 1-D tile-tagging design for handling high-resolution images. Text-based tile tags (e.g., <tile_1>) are inserted before the flattened image tokens of each tile to provide positional information to the LLM. Training involves pretraining (frozen LLM and vision encoder, training only the MLP) and supervised fine-tuning (SFT) (unfrozen LLM and MLP). Crucially, a high-quality text-only SFT dataset is included to maintain/improve text-only performance. NVLM-X (Cross-attention-based): Uses gated cross-attention layers to process image tokens, similar to Flamingo, but without a Perceiver resampler. Image features are projected to the LLM's hidden dimension with a one-layer MLP. Gated X-attention layers are interleaved with LLM self-attention layers. Training also has pretraining and SFT stages. The LLM backbone is unfrozen during SFT, and a high-quality text-only dataset is used. 1-D tile tags are also used, but within the X-attention layers. NVLM-H (Hybrid): Combines aspects of NVLM-D and NVLM-X. The thumbnail image tokens are processed by the LLM's self-attention layers (like NVLM-D), while the regular tile tokens are processed by gated cross-attention (like NVLM-X). This aims to balance multimodal reasoning with computational efficiency. It also uses 1-D tile tags in cross-attention. The 1-D tile-tagging design significantly improves performance, especially on OCR-related tasks, compared to simply concatenating image tokens or using 2D grid/bounding box tags. The authors emphasize that dataset quality and task diversity are more important than sheer scale, even during pretraining. NVLM models achieve strong performance on both vision-language and text-only tasks. This is achieved by including a high-quality text-only dataset during SFT and incorporating multimodal math and reasoning data. Decoder VS X-Attention: Cross attention based models are more efficient in high-resolution images. However, Decoder models provides unified multimodel reasoning and higher accuracy in OCR-related tasks. Curated from open-source datasets, including captioning (COCO, CC3M, SBU, LAION-115M), VQA (VQAv2, Visual Genome, DVQA), document understanding (Docmatix), OCR/Scene-Text (various datasets), and Math (CLEVR-Math). Emphasis on quality over quantity. A diverse collection of task-oriented datasets, including captioning, VQA, chart/diagram understanding, document understanding, OCR, math, and science datasets. High-quality text-only data from various sources (ShareGPT, SlimOrca, EvolInstruct, etc.) and categories (general, math, coding) is crucial for maintaining/improving text-only performance. Refined using GPT-40 and GPT-40-mini. NVLM models are evaluated on a wide range of vision-language benchmarks (MMMU, MathVista, OCRBench, AI2D, ChartQA, DocVQA, TextVQA, RealWorldQA, VQAv2) and text-only benchmarks (MMLU, GSM8K, MATH, HumanEval).
</details>
OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference
OmniVLM is a sub-billion-parameter vision-language model designed for efficient on-device inference, featuring a token compression mechanism that reduces visual token sequence length from 729 to 81, drastically cutting computational overhead while maintaining visual-semantic fidelity. It uses Qwen2.5-0.5B-Instruct model, Google's SigLIP-400M.
Wei Chen, Zhiyuan Li, Shuo Xin
<p align="center">
<img src="https://github.com/user-attachments/assets/da2a140a-5efe-4499-addc-8ccbb3e9792a" width="600"/>
</p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
OmniVLM addresses the challenges of deploying vision-language models (VLMs) on resource-constrained edge devices. It achieves this through a novel token compression mechanism and a multi-stage training pipeline. The core innovation is the image token compression, which transforms the embedding dimensions from [batch_size, 729, hidden_size] to [batch_size, 81, hidden_size] within the projection layer. This 9x reduction in token count is achieved through reshaping, chosen after empirical comparison against convolution-based methods. The model architecture (Figure 1) builds upon the LLaVA framework, employing Google's SigLIP-400M as the vision encoder, Qwen2.5-0.5B-Instruct as the base language model, and a Multi-Layer Perceptron (MLP) as the projection layer. The training pipeline consists of three stages: (1) Pretraining on large-scale image-caption pairs (primarily from the LLaVA pretraining dataset) to learn visual-linguistic alignments, training only the projection layer; (2) Supervised Fine-Tuning (SFT) on a mix of datasets (LLaVA, UnimmChat, and internal data) to improve contextual understanding and conversational coherence, training the projector and LLM while freezing the vision encoder; and (3) Minimal-Edit Direct Preference Optimization (DPO), using a teacher model to create minimally edited corrections to the base model's outputs, forming chosen-rejected pairs for preference learning, again freezing the vision encoder and training the projector and LLM. The DPO process leverages GPT-4V to generate synthetic training pairs. Extensive experiments show that the 81-token configuration provides the optimal balance between computational efficiency and model performance. OmniVLM outperforms nanoLLAVA on benchmarks like ScienceQA, POPE, and MMMU, demonstrating improved reasoning, multimodal comprehension, and generalization. Crucially, it achieves significantly faster inference speeds (9.1x faster time-to-first-token and 1.5x higher decoding speed compared to nanoLLAVA on a laptop, and 8x faster TTFT on a mobile device), making it suitable for deployment on edge devices like smartphones and laptops. </details>
Pixtral 12B: A Cutting-Edge Open Multimodal Language Model
Pixtral 12B is a 12-billion-parameter multimodal language model developed by Mistral AI, designed to excel in both understanding images and text, achieving leading performance on various multimodal benchmarks. The core of the VLM is built upon the transformer architecture. A strong aspect of the VLM is, Pixtral 12B is trained with a new vision encoder from scratch to natively support variable image sizes and aspect ratios.
Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, et al. (Mistral AI Science Team)
<p align="center">
<img src="https://github.com/user-attachments/assets/5187d3c0-e284-40eb-bb94-53105c8cbe11" width="600"/>
</p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
Pixtral 12B has two main components, vision encoder (Pixtral-ViT), which tokenizes images and a multimodal decoder, which predicts the next token given a sequence of text and images. Pixtral can take an arbitrary number of images as an input, provided they fit within its 128K context window. The vision encoder (Pixtral-ViT) is trained from scratch with a novel ROPE-2D implementation, allowing it to process images at their native resolution and aspect ratio. The model can flexibly process images at low resolution in latency-constrained settings, while processing images at high resolution when fine-grained reasoning is required. For distinguishing between images with same number of patches but different aspect ratios, [IMAGE BREAK] tokens are inserted between image rows. Additionally, an [IMAGE END] token at the end of image sequence. The model employs a gated FFN architecture, implementing gating in the hidden layer in place of standard feedforward layer in the attention block. For processing images within a single batch, the model flattens images along the sequence dimension and concatenates them. A block diagonal mask is constructed to prevent attention leakage between patches of different images. Traditional learned and absolute position embeddings are replaced by ROPE-2D, which allows handling variable image sizes. The multimodal decoder of Pixtral is built on top of Mistral Nemo 12B [15], a 12-billion parameter decoder-only language model. The decoder uses a causal self-attention. The vision encoder is connected to the multimodal decoder by a two-layer fully connected network. The paper describes Pixtral as an instruction-tuned model, pre-trained on large-scale interleaved image and text documents. The Paper contributes an open-source benchmark called MM-MT-Bench, for evaluating vision-language models. Pixtral excels at multimodal instruction following, surpassing comparable open-source models on the MM-MT-Bench benchmark. </details>
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Sa2VA is a unified model for dense grounded understanding of both images and videos, integrating the SAM-2 video segmentation model with the LLaVA vision-language model. It supports a wide array of image and video tasks, like referring segmentation and conversation, by treating all inputs (text, images, videos) as tokens in a shared LLM space, generating instruction tokens that guide SAM-2 for precise mask production.
Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang
<p align="center"> <img src="https://github.com/user-attachments/assets/7527a503-4987-4401-961b-f52532788b1f" width="600"/> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
Sa2VA leverages a pre-trained LLaVA-like model (containing a visual encoder, visual projection layer, and LLM) and appends SAM-2 alongside it. Crucially, it uses a decoupled design, where SAM-2's decoder and memory module are frozen. This preserves SAM-2's perception and tracking capabilities and allows Sa2VA to be a plug-and-play module, updatable with newer MLLMs. The connection between the LLM and SAM-2 is a special "[SEG]" token. The LLM generates this token, and its hidden states act as a spatial-temporal prompt for SAM-2's decoder, which produces segmentation masks. The model is trained end-to-end, demonstrating scalability. The training uses a unified instruction-tuning format for various tasks: referring segmentation, visual question answering (VQA), and grounded conversation generation (GCG) for both images and videos. It treats all images, videos and prompts as visual tokens. A key aspect is the co-training with multiple datasets, including image and video data. The authors introduce Ref-SAV, an auto-labeled dataset with over 72,000 object expressions in complex video scenes, and manually validate 2,000 video objects in Ref-SAV for benchmarking referring video object segmentation. A simple mask tracking method re-utilizes SAM-2's knowledge. The model formulates all tasks as a single instruction-tuning process. Datasets used for co-training are: LLAVA 1.5 (665K), RefCOCO (17K), RefCOCO+ (17K), RefCOCOg (22K), Grand-f (214K), ChatUniVi (100K). Ref-YTVOS (3.5K), MeVIS (0.6K), ReVOS (1.7K) and Ref-SAV (37K). </details>
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding
Tarsier2 is a state-of-the-art large vision-language model (LVLM) that excels in generating detailed and accurate video descriptions and demonstrates superior general video understanding capabilities. It scales pre-training data, performs fine-grained temporal alignment during supervised fine-tuning, and uses model-based sampling with Direct Preference Optimization (DPO) to improve performance, outperforming models like GPT-4o and Gemini 1.5 Pro.
Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, Yuan Lin
<p align="center"> <img src="https://github.com/user-attachments/assets/e6626842-69ac-4547-8c4b-cb260dd349ca" width="600"/> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
Tarsier2 utilizes a straightforward architecture consisting of a vision encoder, a vision adaptor, and a large language model (LLM), specifically building upon Qwen2-VL. The model undergoes a three-stage training process: pre-training, supervised fine-tuning (SFT), and reinforcement learning (RL) using Direct Preference Optimization (DPO). A key improvement over its predecessor, Tarsier, is the significant expansion of the pre-training dataset from 11 million to 40 million video-text pairs. This expansion includes the meticulous collection and filtering of 11 million commentary videos (explanations and analyses of movies and TV shows), providing rich contextual information. During the SFT stage, Tarsier2 is trained on a dataset containing 150K instances, each with a detailed video description and specific frame annotations corresponding to each described event. This fine-grained temporal alignment provides supervision that improves accuracy and reduces hallucinations compared to traditional video-caption alignment. The SFT phase is conducted in two steps. The initial step is frame to event allignment. Then, the model's output to make a more human-like style. The final training stage employs DPO with automatically generated preference data. Negative samples are created by corrupting videos (clip-switching, clip-reversing, clip-cropping, and down-sampling), and a preference data filtering method (using AutoDQ) ensures high-quality pairs. Tarsier2 achieves state-of-the-art results on 15 public benchmarks, demonstrating its versatility across tasks such as video question-answering, video grounding, hallucination tests, and embodied question-answering. A recaptioning dataset, Tarsier2-Recap-585K, is also released. </details>
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
UI-TARS is a native GUI agent model that operates solely on screenshots, performing human-like interactions (keyboard and mouse operations). Unlike frameworks relying on wrapped commercial models (e.g., GPT-4o), UI-TARS is an end-to-end model achieving state-of-the-art (SOTA) performance on 10+ GUI agent benchmarks in perception, grounding, and task execution, significantly outperforming sophisticated frameworks.
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi
<p align="center"> <img src="https://github.com/user-attachments/assets/9dccbdf3-a0ab-4ae4-930b-09a974f14a03" width="600"/> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
UI-TARS leverages several key innovations: (1) Enhanced Perception, utilizing a large-scale GUI screenshot dataset for context-aware understanding and precise captioning of UI elements; (2) Unified Action Modeling, standardizing actions into a unified space across platforms and achieving precise grounding through large-scale action traces; (3) System-2 Reasoning, incorporating deliberate reasoning for multi-step decision-making, including task decomposition, reflection, and milestone recognition; and (4) Iterative Training with Reflective Online Traces, addressing the data bottleneck by automatically collecting, filtering, and refining interaction traces on hundreds of virtual machines. The model is trained iteratively and tuned via reflection, continuously learning from mistakes and adapting to new situations with minimal human intervention. The architecture takes screenshots as input and uses a Vision-Language Model (VLM), specifically Qwen-2-VL 7B and 72B, to process visual information and generate actions. The action space is unified across platforms (mobile, desktop, web) and includes actions like click, type, scroll, and drag. Reasoning is infused by generating explicit "thoughts" before each action, inspired by the ReAct framework. These thoughts are generated through a combination of curated GUI tutorials and augmented action traces, incorporating patterns like task decomposition, long-term consistency, milestone recognition, trial and error, and reflection. The training process involves multiple stages, starting with perception enhancement using a curated dataset of GUI screenshots and associated metadata. This dataset supports tasks like element description, dense captioning, state transition captioning, question answering, and set-of-mark prompting. Action modeling is improved by creating a large-scale dataset of action traces and using grounding data to pair element descriptions with spatial coordinates. The model is trained using a combination of supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) with reflection tuning to learn from errors. </details>
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
VideoChat-Flash is a system designed for handling long-form video content in multimodal large language models (MLLMs). It introduces a Hierarchical visual token Compression (HiCo) method to reduce computational load while preserving essential details, and uses a multi-stage learning approach with a new long-video dataset (LongVid) to achieve state-of-the-art performance on both long and short video benchmarks.
Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, Limin Wang
<p align="center"> <img src="https://github.com/user-attachments/assets/49048795-6a76-41e7-b441-1313d0813630" width="600"/> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
Hierarchical visual token Compression (HiCo): This is the core innovation. It compresses video information at two levels: Clip-level Compression: The video is divided into clips. A visual encoder (UMT-L) processes each clip, and a compressor (token merging with MLP) reduces the number of visual tokens. This exploits inter-frame redundancy. Video-level Compression: During the LLM (Qwen2-7B) interaction, visual tokens are further reduced using a progressive visual dropout strategy. This leverages the idea that the LLM focuses on the entire context at shallow layers and specific details at deeper layers. It combines uniform dropout (shallow layers) and text-guided selection (deep layers). Visual Encoder: UMT-L@224 [30] (a video encoder, shown to be more efficient than image encoders like SigLIP). Visual-Language Connector: A compressor (token merging) followed by an MLP projection. Large Language Model (LLM): Qwen2-7B. Multi-stage Short-to-Long Learning: This is a crucial training strategy: Stage 1: Video-Language Alignment: Train the compressor and MLP with image-text and short video-text pairs (0.5M each). Stage 2: Short Video Pre-training: Enhance visual understanding with more images (3.5M) and short videos (2.5M). Stage 3: Joint Short & Long Video Instruction Tuning: Fine-tune on a mix of images (1.1M), short videos (1.7M), and long videos (0.7M) with instruction-following data. Stage 4: Efficient High-Resolution Post-finetuning: Adapt to higher resolutions (224 to 448) by fine-tuning the video encoder on a subset (25%) of Stage 3 data.Dynamic Video Sampling: Uses a dual sampling strategy: dense sampling (15 fps) for short videos (capturing motion) and sparse sampling (1 fps) for long videos (capturing events). Timestamp-aware Prompt: Uses a simple text prompt to provide timestamp information to the model: "The video lasts for N seconds, and T frames are uniformly sampled from it.LongVid: A new large-scale long video instruction-tuning dataset introduced in the paper. It contains 114,228 long videos and 3,444,849 question-answer pairs across five task types. It leverages existing datasets (Ego4D, HowTo100M, HD-Vila, MiraData) and generates dense event labels. Mixed Training Data: Uses a combination of short and long videos during training. NIAH (Needle In A video Haystack). A newly created dataset for testing models capabilities for understanding long contexts.
</details>
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 is a vision-centric multimodal foundation model designed for both image and video understanding, emphasizing a training paradigm and framework that prioritize high-quality image-text data, alongside an adaptable vision encoder and video token compression, to achieve state-of-the-art performance.
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao
<p align="center"> <img src="https://github.com/user-attachments/assets/350a1228-c14e-45ed-b59f-e99608ee9a7d" width=600/> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
VideoLLaMA3 introduces a vision-centric approach in both its training paradigm and framework design, focusing on enhancing image and video understanding capabilities. The core architecture incorporates a pre-trained vision encoder (SigLIP), a video compressor (DiffFP), a projector, and a large language model (LLM - Qwen2.5). The model employs a four-stage training process: 1) Vision Encoder Adaptation, where the vision encoder is adapted to accept images of variable resolutions using scene images, document data, and scene text images; 2) Vision-Language Alignment, which jointly tunes the vision encoder, projector, and LLM using large-scale image-text data (including detailed captions, documents, and charts) and a small amount of text-only data; 3) Multi-task Fine-tuning, incorporating image-text data for downstream tasks and general video caption data; and 4) Video-centric Fine-tuning, using general videos, streaming videos, temporally grounded videos, image-only, and text-only data. A key innovation is Any-resolution Vision Tokenization (AVT), which allows the vision encoder to process images and videos of any resolution by replacing fixed positional embeddings with Rotary Position Embedding (RoPE). This enables handling images with variable shapes and minimal information loss. For video inputs, Differential Frame Pruner (DiffFP) acts as a video compressor, reducing the number of vision tokens by comparing the 1-norm distance between temporally consecutive patches in pixel space and pruning redundant patches. This makes video representations more compact and precise. The training data mixture is carefully curated for each stage, emphasizing high-quality image-text data. The Vision Encoder Adaptation stage uses datasets like VL3-Syn7M-short, LLaVA-Pretrain-558k, and document datasets. The Vision-Language Alignment stage expands on this with detailed captions, OCR data, and fine-grained data with bounding boxes. The Multi-task Fine-tuning stage adds question-answering data and general video caption data. Finally, the Video-centric Fine-tuning stage includes general videos, streaming videos, and temporal grounding data. This "vision-centric" approach, prioritizing image understanding as a foundation for video understanding, along with AVT and DiffFP, allows VideoLLaMA3 to achieve strong performance on both image and video benchmarks. </details>
Llama 3.2-Vision: Enhanced Multimodal Capabilities Built on Llama 3
Llama 3.2-Vision extends the Llama 3 text-only model with multimodal capabilities, allowing it to process both text and images. This model, available in 11B and 90B parameter sizes, leverages a vision adapter with cross-attention layers to integrate image representations from a separate vision encoder into the core Llama 3 LLM, achieving strong performance on visual recognition, image reasoning, captioning, and visual question answering.
Meta
<p align="center">
<img src="https://github.com/user-attachments/assets/f6428237-8607-4de1-975d-dfc4c683b7a3" width=600/>
</p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
Llama 3.2-Vision builds upon the Llama 3 architecture, an auto-regressive language model using an optimized transformer. It adds a vision adapter, comprised of cross-attention layers, to incorporate visual information. This adapter receives input from a separate vision encoder (not part of the core Llama 3 model), allowing the model to process images without directly converting them into text tokens. The <|image|> tag within the prompt signifies the presence of an image and dictates where the visual information is integrated via cross-attention. This integration occurs after the image tag and influences subsequent text tokens. The model supports a context length of 128k tokens and utilizes Grouped-Query Attention (GQA). The model family was trained on 6B image-text pairs. Pretraining data cutoff is December 2023, supports English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. However image-text tasks are only in English. The model's training involves supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) for instruction-tuned versions. The base models are suitable for text completion, with or without an image, using specific prompt formats. Instruction-tuned models excel at tasks like Visual Question Answering (VQA), Document VQA (DocVQA), image captioning, and image-text retrieval. The training process includes stages of pretraining and annealing, leveraging a vast amount of data and significant computational resources (H100 GPUs). Key capabilities include handling both text and image inputs, answering questions about images, generating captions, and performing visual reasoning. The model does not support built-in tool calling (like brave_search or wolfram_alpha) when an image is present in the prompt; tool calling is only available for text-only inputs. The intended use cases cover a wide range of applications, but usage is restricted by the Llama 3.2 Community License and Acceptable Use Policy, particularly regarding languages and potential misuse. Meta emphasizes a responsible deployment approach, including providing tools like Llama Guard for safety and encouraging developers to implement appropriate safeguards. The model underwent extensive evaluations, including red teaming and assessments for critical risks such as CBRNE, child safety, and cyber attacks.
</details>
SmolVLM: A Small, Efficient, and Open-Source Vision-Language Model
SmolVLM is a 2B parameter vision-language model (VLM) that achieves state-of-the-art performance for its memory footprint, offering a small, fast, and memory-efficient solution for multimodal tasks. It is fully open-source, with all model checkpoints, datasets, training recipes, and tools released under the Apache 2.0 license, enabling local deployment, reduced inference costs, and user customization.
Andres Marafioti, Merve Noyan, Miquel FarrΓ©, Elie Bakouch, Pedro Cuenca
<p align="center"> <img src="https://github.com/user-attachments/assets/901ed802-5c1c-4733-ab2a-6b61514b9c71" width="600"/> </p>
<details> <summary>βΉοΈ <i>More Information</i></summary>
SmolVLM builds upon the architecture of Idefics3, leveraging a similar implementation in transformers but with key differences to enhance efficiency. It replaces the Llama 3.1 8B language backbone with the smaller SmolLM2 1.7B model. A more aggressive image compression strategy is employed, using a pixel shuffle strategy that reduces visual information by a factor of 9 (compared to 4x in Idefics3). This allows for 384x384 patches, and a shape-optimized SigLIP is used as the vision backbone with 14x14 inner patches. The model demonstrates superior memory usage compared to other VLMs in transformers, enabling efficient on-device inference. For instance, encoding a single image and prompt requires only 1.2k tokens, significantly less than models like Qwen2-VL. This efficiency translates to faster prefill and generation throughputs. SmolVLM achieves strong performance on benchmarks such as MMMU, MathVista, MMStar, DocVQA, and TextVQA. It also shows promising results in basic video analysis, leveraging its long context capabilities. Training involved extending the context window of SmolLM2 to 16k tokens using techniques like RoPE base value adjustment and fine-tuning on a mixture of long- and short-context datasets. A curated training dataset, largely based on The Cauldron and Docmatix, was used for the VLM training. Checkpoint selection was based on a weighted metric across multiple vision-language benchmarks. The model is integrated with VLMEvalKit for easy evaluation, and it can be readily used and fine-tuned with the transformers library. TRL integration allows for applying Direct Preference Optimization (DPO). A notebook is provided for fine-tuning on VQAv2, with options for LoRA, QLoRA, or full fine-tuning, even within the constraints of consumer GPUs. </details>
Idefics2
IDEFICS2, an 8B parameter open-source vision-language model, efficiently processes interleaved image and text sequences by combining a SigLIP vision encoder, a Mistral-7B LLM, and a Perceiver pooling layer with MLP projection for robust text encoding, excelling in tasks like OCR and document understanding.
Hugo LaurenΓ§on, LΓ©o Tronchon, Matthieu Cord, Victor Sanh
<p align="center">
<img src="https://github.com/gokayfem/awesome-vlm-architectures/assets/88277926/c197c8c5-8da2-4d96-8999-8e05e81f1506" />
</p>
<details>
<summary>βΉοΈ <i>More Information</i></summary>
IDEFICS2 is an 8B parameter open-source vision-language model adept at handling interleaved image and text sequences. IDEFICS2 utilizes a vision-language architecture designed for efficient processing of image and text sequences. It employs the SigLIP model as the vision encoder, extracting features from images in their native resolutions and aspect ratios. The Mistral-7B model serves as the LLM backbone, providing language understanding and generation capabilities. For text encoding, IDEFICS2 leverages a Perceiver pooling layer followed by an MLP projection to integrate visual features with the LLM's embedding space. This combination of vision encoder, LLM, and text encoder enables IDEFICS2 to handle various multimodal tasks, with a particular focus on OCR and document understanding. The model is trained on a diverse dataset encompassing OBELICS, LAION Coco, and PMD, with additional data for OCR tasks. Fine-tuning is performed on instruction datasets like The Cauldron and OpenHermes-2.5. </details>
Idefics3-8B: Building and Better Understanding Vision-Language Models
Idefics3-8B is a powerful open-source vision-language model (VLM) that significantly outperforms its predecessor, Idefics2-8B, while being trained efficiently and exclusively on open datasets. It leverages a straightforward pipeline and introduces Docmatix, a massive dataset for document understanding, to achieve state-of-the-art performance within its size category across various multimodal benchmarks.
truncated β full list on GitHub