[TOC]

  1. Title: Recent Language Model Technique 2024
  2. Review Date: Thu, Apr 25, 2024
  3. url: https://www.youtube.com/watch?v=kzB23CoZG30
  4. url2: https://www.youtube.com/watch?v=iH-wmtxHunk
  5. url3: https://www.youtube.com/watch?v=o68RRGxAtDo

LLama 3

image-20240425125031837

  • key modification: grouped query attention (GQA)

  • key instruction-tuning process:

    • Their approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct preference optimization (DPO).
    • The quality of the prompts that are used in SFT and the preference rankings that are used in PPO and DPO has an outsized influence on the performance of aligned models.
  • fine-tuning tool: torchtune

Llama architecture

  • RMSNorm pre-normalization: This is a normalization technique used in the model
  • SwiGLU activation function: This is the activation function used in the model]
  • Rotary Position Embedding (RoPE): Llama2 uses a new kind of positional embedding mechanism called Rotary Position Embedding (RoPE)

Extra: Sampling and Rejection Sampling

ref: https://www.youtube.com/watch?v=9ixzzPQWuAY (Inverse Transform Sampling)

ref: https://www.youtube.com/watch?v=OXDqjdVVePY (Accept-Reject Sampling)

Grouped Query Attention

  • Purpose: save computational cost
  • Predecessor architecture: Multi-Query Attention
image-20240425154324824

image-20240425154556029

image-20240425154646875

Result

image-20240425154750368

NorNet: Efficient High Order Spatial Interaction with Recursive Gated Convolution

ref: https://arxiv.org/abs/2207.14284

image-20240425223058722

image-20240425223112089

Check definition of depthwise convolution: https://www.youtube.com/watch?v=vVaRhZXovbw

image-20240425223355380

Capability Forgetting (from GLM-4)

as a post-training step subsequent to SFT, the author also observed unexpected behaviour in the policy after the RLHF stage. The model shows a reduced capability in handling specific scenarios.

  • reason
    • this behavior could be attributed to the problem of difference among data distributions or the inability of the reward model in such nuanced details.
  • solution
    • to overcome, the author incorporate an extra supervised next token prediction loss as an additional regularization besides the KL divergence. (in PPO)
    • this is intended to preserve the pre-existing abilities of the SFT model, by encouraging the model’s outputs to align with human preferences through RLHF and leveraging next-token prediction to capture more granular signals.
    • the next token prediction with a small amount of human-annotated (prompt, response) pairs $\mathcal{D}_S$, which are specific to particular tasks and serve as supervised regularization (trying to shifting the data distribution back)

Direct Preference Optimization

can we just use Cross entropy instead of PPO?

image-20240426231336360

image-20240426232104891