【更新中】LLaVA系列论文整理

飞书用户9360

5月9日修改

llava系列名称	发布时间	论文\|文章标题	模型架构	摘要	使用数据	关键实验结果	备注
llava	2023.4	Visual Instruction Tuning		首次使用GPT-4（语言模型，输入只有文本）生成多模指令数据集。与GPT-4
llava-1.5	2023.10	Improved Baselines with Visual Instruction Tuning
llava-next	2024.1.30	LLaVA-NeXT: Improved reasoning, OCR, and world knowledge		1. Increasing the input image resolution to 4x more pixels. This allows it to grasp more visual details. It supports three aspect ratios, up to 672x672, 336x1344, 1344x336 resolution. 2. Better visual reasoning and OCR capability with an improved visual instruction tuning data mixture. 3. Better visual conversation for more scenarios, covering different applications. Better world knowledge and logical reasoning. 4. Efficient deployment and inference with SGLang.			1. Blog: https://llava-vl.github.io/blog/2024-01-30-llava-next/
llava-next (Video)	2024.4,30	LLaVA-NeXT: A Strong Zero-shot Video Understanding Model		1. Zero-shot video representation capabilities with AnyRes：this is the first time that LMMs show strong zero-shot modality transfer ability.（这个能力是llava-next中AnyRes带来的） 2. Inference with length generalization improves on longer videos 3. Strong video understanding ability 4. Efficient deployment and inference with SGLang.			1. Blog: https://llava-vl.github.io/blog/2024-04-30-llava-next-video/
llava-next	2024.5.10	LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild		实验验证，在同样的条件下，更强的LLM可以显著的增强多模模型的能力。使用最近的更强的llm扩展llava-next，有如下两个发现： 1. Increasing multimodal capaiblies with stronger & larger language models, up to 3x model size. This allows LMMs to present better visual world knowledge and logical reasoning inherited from LLM. It supports LLaMA3 (8B) and Qwen-1.5 (72B and 110B). 2. Better visual chat for more real-life scenarios, covering different applications. To evaluate the improved multimodal capabilities in the wild, we collect and develop new evaluation datasets, LLaVA-Bench (Wilder), which inherit the spirit of LLaVA-Bench (in-the-wild) to study daily-life visual chat and enlarge the data size for comprehensive evaluation.
llava-next	2024.5.25	LLaVA-NeXT: What Else Influences Visual Instruction Tuning Beyond Data?		除了数据影响指令微调，还有什么影响指令微调 1. Architectures：The model size scaling of LLM is more effective than image encoder in yielding improved performance. The success of the latter is more related to its visual input configuration (resolution, #token) than its model size. 2. Visual Representations: The representation of visual signals relates to both the resolution in the raw pixel space and the number of tokens in the feature space. The scaling of both factors leads to improved performance, especially on tasks that require visual details. To strike a balance of performance and cost, we observe that the scaling of resolution is more effective than the scaling of token numbers, and recommend an AnyRes strategy with pooling. 3. Training Strategies：In complementary to prior LLaVA series that focus on visual instruction tuning stage only, we explore the impact of training strategies in LLaVA's earlier model life cycle, by varying training data amount, quality, and trainable modules. Our findings suggest the significance of incorporating a stage focused on learning from high-quality knowledge, as opposed to web-scale low-quality data. Specifically, this involves training the entire model using synthetic high-quality data, re-captioned by LLaVA-NeXT-34B.
llava-next-Interleave	2024.6.16	LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models		加入多图、视频、多视角（multi-view）支持。组织M4数据集（Multi-image、Multi-frame、Multi-view、Multi-patch） Highlights： 1. Interleave data format unifies different tasks. We represent multi-image, video, 3D, and single-image data all into an interleaved training format, which unifies different tasks in a single LLM. 2. New datasets: (1) Training Data: M4-Instruct. We compile a high-quality dataset with 1177.6k samples, spanning 4 primary domains (multi-image, video, 3D, and single-image). (2). LLaVA-Interleave Bench. We curate a diverse set of tasks to evaluate the multi-image capabilities in 3 scenarios, including 9 newly collected and 13 existing in/out-domain benchmarks. 3. SoTA Performance.(1). With a single model, LLaVA-NeXT-Interleave can achieve leading results in different multi-image benchmarks compared to the previous SoTA. (2). With a proper data mix of different scenarios, the performance of previous individual tasks can be improved or maintained. For example, we maintain the single-image performance of LLaVA-NeXT and improve performance on video tasks. 4. Emerging capabilities with cross-task transfer. By jointly training on a diverse set of visual data modalities, the model shows emerging capabilities to transfer tasks between different scenarios.
llava-onevision	2024.8.5	LLaVA-OneVision: Easy Visual Task Transfer
llava-video	2024.10.4	Video Instruction Tuning with Synthetic Data
llava-critic	2024.10.4	LLaVA-Critic: Learning to Evaluate Multimodal Models

Visual Instruction Tuning

Improved Baselines with Visual Instruction Tuning

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge

LLaVA-NeXT: A Strong Zero-shot Video Understanding Model

LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild

【更新中】LLaVA系列论文整理​

【更新中】LLaVA系列论文整理