GRPO
GRPO
Main idea
Key point it to understand the below pictures
Iteration steps
- for each input, generator
G
outputs - for each output, calculate logits_prob for each token in current, old, reference model
- calcualte objective value as loss
- update old model in each step
- update reference model in each epoch
Objective function
G
is amount of outputs in each group for each inputO_i
is i-th output in current groupt
is index of tokens inO_i
q
is inputO_i,t
is t-tokens in i-th outputpi
is model parameter
KL value
Hyper parameters
Name in huggingface-trl
beta
weight for KL-value between current model and reference model, increase to avoid over-fittingnum_iterations
Numbers of iteration per batch, GRPO iterations times inAlgorithm 1 picture
, similar with LRepsilon
for both clip lower_bound and upper_boundepsilon_high
repalceepsilon
for clip upper_bound when existsync_ref_model
bool, whether to Whether to synchronize the reference model with the active model everyref_model_sync_steps
steps, using theref_model_mixup_alpha
parameterref_model_mixup_alpha
float, default 0.6,π_ref = α * π_θ + (1 - α) * π_ref_prev
ref_model_sync_steps
int, default 512, To use this parameter, you must set sync_ref_model=True.
FAQ
Q: How to cold start?
A: In first step, we know advantages for each output, which can push parameters updating to make objective value as much as possible
Q: How to simplify Zoom up/down in objective function?
This post is licensed under CC BY 4.0 by the author.