[TOC]

  1. Title: Llama Adapter V2
  2. Author: Peng Gao et. al.
  3. Publish Year: 28 Apr 2023
  4. Review Date: Mon, Aug 28, 2023
  5. url: https://arxiv.org/pdf/2304.15010.pdf

Summary of paper

The paper presents LLaMA-Adapter V2, an enhanced version of the original LLaMA-Adapter designed for multi-modal reasoning and instruction following. The paper aims to address the limitations of the original LLaMA-Adapter, which could not generalize well to open-ended visual instructions and lagged behind GPT-4 in performance.

Key Features of LLaMA-Adapter V2:

  1. More Learnable Parameters: The new version unlocks additional learnable parameters like norms, biases, and scales. This distributes the instruction-following ability across the entire LLaMA model, not just the adapters.
  2. Early Fusion Strategy: Visual tokens are fed only into the early layers of the Large Language Model (LLM), which helps in better incorporation of visual knowledge.
  3. Joint Training Paradigm: It introduces a joint training approach that optimizes disjoint groups of learnable parameters for image-text pairs and instruction-following data. This helps to minimize the interference between the two tasks and improves multi-modal reasoning.
  4. Expert Models: During inference, additional expert models like captioning and OCR systems are incorporated to enhance the modelโ€™s image understanding capabilities without incurring additional training costs.

Advantages:

The paper suggests that LLaMA-Adapter V2 is a more parameter-efficient and capable model for handling visual instructions and multi-modal reasoning tasks.