Saturday, October 19, 2024

LLM, learning rates are different in pre-training and fine-tuning

 

https://www.youtube.com/watch?v=9vM4p9NN0Ts&t=3198s

LLM, learning rates are different in pre-training and fine-tuning, even though the loss function can be the same. 

 Reinforcement Learning from Human Feedback (RLHF) and its implementation with Proximal Policy Optimization (PPO), focusing on how the reward is defined in this context. Here is a breakdown:


1. Idea: use reinforcement learning

The slide presents the concept of applying reinforcement learning to improve models, based on human feedback or preferences.

2. What is the reward?

In reinforcement learning, a “reward” is a signal used to guide the model to make better predictions. The slide discusses two options for determining the reward:

Option 1: Binary reward

This option compares whether the model’s output is preferred over some baseline. However, the slide mentions that binary rewards do not carry much information, limiting the feedback’s usefulness.

Option 2: Train a reward model  R 

Instead of using a simple binary comparison, the reward is calculated using a logistic regression loss to classify preferences between outputs  i  and  j . This method is more nuanced, as it allows for a continuous measure of preference rather than a binary yes/no. The formula provided is:


p(i > j) = \frac{\exp(R(x, \hat{y}_i))}{\exp(R(x, \hat{y}_i)) + \exp(R(x, \hat{y}_j))}


This formula, attributed to Bradley-Terry 1952, is used to compute the probability that one output is preferred over another by using their respective rewards  R . This gives more detailed, “information-heavy” feedback.

3. Optimization Process

The reward model is then optimized using PPO (Proximal Policy Optimization), which is a reinforcement learning algorithm. The objective function is:


\mathbb{E}{\hat{y} \sim p\theta(\hat{y}|x)} \left[ R(x, \hat{y}) - \beta \log \frac{p_\theta(\hat{y}|x)}{p_{\text{ref}}(\hat{y}|x)} \right]


This equation essentially seeks to balance maximizing the reward  R(x, \hat{y})  and maintaining a policy that does not deviate too much from a reference policy (hence, the log ratio term), with  \beta  acting as a regularization factor to prevent over-optimization.


In summary, this slide describes how RLHF can be implemented by training a reward model using logistic regression to classify preferences between different outputs. This approach uses a continuous reward signal rather than a binary one and optimizes it using PPO while including regularization to avoid overfitting



No comments:

Post a Comment