Recent Language Model Technique 2024

[TOC]

Title: Recent Language Model Technique 2024
Review Date: Thu, Apr 25, 2024
url: https://www.youtube.com/watch?v=kzB23CoZG30
url2: https://www.youtube.com/watch?v=iH-wmtxHunk
url3: https://www.youtube.com/watch?v=o68RRGxAtDo

LLama 3

key modification: grouped query attention (GQA)
key instruction-tuning process:
- Their approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct preference optimization (DPO).
- The quality of the prompts that are used in SFT and the preference rankings that are used in PPO and DPO has an outsized influence on the performance of aligned models.
fine-tuning tool: torchtune

Llama architecture

RMSNorm pre-normalization: This is a normalization technique used in the model
SwiGLU activation function: This is the activation function used in the model]
Rotary Position Embedding (RoPE): Llama2 uses a new kind of positional embedding mechanism called Rotary Position Embedding (RoPE)

Extra: Sampling and Rejection Sampling

ref: https://www.youtube.com/watch?v=9ixzzPQWuAY (Inverse Transform Sampling)

ref: https://www.youtube.com/watch?v=OXDqjdVVePY (Accept-Reject Sampling)

Grouped Query Attention

Purpose: save computational cost
Predecessor architecture: Multi-Query Attention

Result

NorNet: Efficient High Order Spatial Interaction with Recursive Gated Convolution

ref: https://arxiv.org/abs/2207.14284

Check definition of depthwise convolution: https://www.youtube.com/watch?v=vVaRhZXovbw

Capability Forgetting (from GLM-4)

as a post-training step subsequent to SFT, the author also observed unexpected behaviour in the policy after the RLHF stage. The model shows a reduced capability in handling specific scenarios.

reason
- this behavior could be attributed to the problem of difference among data distributions or the inability of the reward model in such nuanced details.
solution
- to overcome, the author incorporate an extra supervised next token prediction loss as an additional regularization besides the KL divergence. (in PPO)
- this is intended to preserve the pre-existing abilities of the SFT model, by encouraging the model’s outputs to align with human preferences through RLHF and leveraging next-token prediction to capture more granular signals.
- the next token prediction with a small amount of human-annotated (prompt, response) pairs $\mathcal{D}_S$, which are specific to particular tasks and serve as supervised regularization (trying to shifting the data distribution back)

Direct Preference Optimization

can we just use Cross entropy instead of PPO?

LLama 3#

Llama architecture#

Extra: Sampling and Rejection Sampling#

Grouped Query Attention#

NorNet: Efficient High Order Spatial Interaction with Recursive Gated Convolution#

Capability Forgetting (from GLM-4)#

Direct Preference Optimization#