飞书用户9360llava系列名称 | 发布时间 | 论文|文章标题 | 模型架构 | 摘要 | 使用数据 | 关键实验结果 | 备注 |
llava | 2023.4 | Visual Instruction Tuning | 首次使用GPT-4(语言模型,输入只有文本)生成多模指令数据集。与GPT-4 | | | | |
llava-1.5 | 2023.10 | Improved Baselines with Visual Instruction Tuning | | | | | |
llava-next | 2024.1.30 | LLaVA-NeXT: Improved reasoning, OCR, and world knowledge | | 1. Increasing the input image resolution to 4x more pixels. This allows it to grasp more visual details. It supports three aspect ratios, up to 672x672, 336x1344, 1344x336 resolution. 2. Better visual reasoning and OCR capability with an improved visual instruction tuning data mixture. 3. Better visual conversation for more scenarios, covering different applications. Better world knowledge and logical reasoning. 4. | | ||
llava-next (Video) | 2024.4,30 | LLaVA-NeXT: A Strong Zero-shot Video Understanding Model | | 1. Zero-shot video representation capabilities with AnyRes:this is the first time that LMMs show strong zero-shot modality transfer ability.(这个能力是llava-next中AnyRes带来的) 2. Inference with length generalization improves on longer videos 3. Strong video understanding ability 4. | | ||
llava-next | 2024.5.10 | LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild | | 实验验证,在同样的条件下,更强的LLM可以显著的增强多模模型的能力。 使用最近的更强的llm扩展llava-next,有如下两个发现: 1. Increasing multimodal capaiblies with stronger & larger language models, up to 3x model size. This allows LMMs to present better visual world knowledge and logical reasoning inherited from LLM. It supports LLaMA3 (8B) and Qwen-1.5 (72B and 110B). 2. Better visual chat for more real-life scenarios, covering different applications. To evaluate the improved multimodal capabilities in the wild, we collect and develop new evaluation datasets, LLaVA-Bench (Wilder), which inherit the spirit of LLaVA-Bench (in-the-wild) to study daily-life visual chat and enlarge the data size for comprehensive evaluation. | | | |
llava-next | 2024.5.25 | LLaVA-NeXT: What Else Influences Visual Instruction Tuning Beyond Data? | | 除了数据影响指令微调,还有什么影响指令微调 1. Architectures:The model size scaling of LLM is more effective than image encoder in yielding improved performance. The success of the latter is more related to its visual input configuration (resolution, #token) than its model size. 2. Visual Representations: The representation of visual signals relates to both the resolution in the raw pixel space and the number of tokens in the feature space. The scaling of both factors leads to improved performance, especially on tasks that require visual details. To strike a balance of performance and cost, we observe that the scaling of resolution is more effective than the scaling of token numbers, and recommend an AnyRes strategy with pooling. 3. Training Strategies:In complementary to prior LLaVA series that focus on visual instruction tuning stage only, we explore the impact of training strategies in LLaVA's earlier model life cycle, by varying training data amount, quality, and trainable modules. Our findings suggest the significance of incorporating a stage focused on learning from high-quality knowledge, as opposed to web-scale low-quality data. Specifically, this involves training the entire model using synthetic high-quality data, re-captioned by LLaVA-NeXT-34B. | | | |
llava-next-Interleave | 2024.6.16 | LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models | | 加入多图、视频、多视角(multi-view)支持。 组织M4数据集(Multi-image、Multi-frame、Multi-view、Multi-patch) Highlights: 1. Interleave data format unifies different tasks. We represent multi-image, video, 3D, and single-image data all into an interleaved training format, which unifies different tasks in a single LLM. 2. New datasets: (1) Training Data: M4-Instruct. We compile a high-quality dataset with 1177.6k samples, spanning 4 primary domains (multi-image, video, 3D, and single-image). (2). LLaVA-Interleave Bench. We curate a diverse set of tasks to evaluate the multi-image capabilities in 3 scenarios, including 9 newly collected and 13 existing in/out-domain benchmarks. 3. SoTA Performance.(1). With a single model, LLaVA-NeXT-Interleave can achieve leading results in different multi-image benchmarks compared to the previous SoTA. (2). With a proper data mix of different scenarios, the performance of previous individual tasks can be improved or maintained. For example, we maintain the single-image performance of LLaVA-NeXT and improve performance on video tasks. 4. Emerging capabilities with cross-task transfer. By jointly training on a diverse set of visual data modalities, the model shows emerging capabilities to transfer tasks between different scenarios. | | | |
llava-onevision | 2024.8.5 | LLaVA-OneVision: Easy Visual Task Transfer | | | | | |
llava-video | 2024.10.4 | Video Instruction Tuning with Synthetic Data | | | | | |
llava-critic | 2024.10.4 | LLaVA-Critic: Learning to Evaluate Multimodal Models | | | | | |