[TOC]
- Title: Llama Adapter V2
- Author: Peng Gao et. al.
- Publish Year: 28 Apr 2023
- Review Date: Mon, Aug 28, 2023
- url: https://arxiv.org/pdf/2304.15010.pdf
Summary of paper
The paper presents LLaMA-Adapter V2, an enhanced version of the original LLaMA-Adapter designed for multi-modal reasoning and instruction following. The paper aims to address the limitations of the original LLaMA-Adapter, which could not generalize well to open-ended visual instructions and lagged behind GPT-4 in performance.
Key Features of LLaMA-Adapter V2:
- More Learnable Parameters: The new version unlocks additional learnable parameters like norms, biases, and scales. This distributes the instruction-following ability across the entire LLaMA model, not just the adapters.
- Early Fusion Strategy: Visual tokens are fed only into the early layers of the Large Language Model (LLM), which helps in better incorporation of visual knowledge.
- Joint Training Paradigm: It introduces a joint training approach that optimizes disjoint groups of learnable parameters for image-text pairs and instruction-following data. This helps to minimize the interference between the two tasks and improves multi-modal reasoning.
- Expert Models: During inference, additional expert models like captioning and OCR systems are incorporated to enhance the modelโs image understanding capabilities without incurring additional training costs.
Advantages:
- Parameter Efficiency: Only 14 million additional parameters are introduced over the original LLaMA model.
- Improved Performance: The new version performs better in open-ended multi-modal instructions and even excels in chat interactions.
The paper suggests that LLaMA-Adapter V2 is a more parameter-efficient and capable model for handling visual instructions and multi-modal reasoning tasks.