Figuring out and posting about how RLHF and other methods ([online] decision transformer, IDA, rejection sampling, etc) modify the nature of simulators is very high priority. There’s an ongoing research project at Conjecture specifically about this, which is the main reason I didn’t emphasize it as a future topic in this sequence. Hopefully we’ll put out a post about our preliminary theoretical and empirical findings soon.
Some interesting threads:
RL with KL penalties better seen as Bayesian inference shows that the optimal policy when you hit a GPT with RL with a KL penalty weighted by 1 is actually equivalent to conditioning the policy on a criteria estimated by the reward model, which is compatible with the simulator formalism.
However, this doesn’t happen in current practice, because 1. both OAI and Anthropic use very small KL penalties (e.g. weighted by 0.001 in Anthropic’s paper—which in the Bayesian inference framework means updating on the “evidence” 1000 times) or maybe none at all 2. early stopping: the RL training does not converge to anything near optimality. Path dependence/distribution shift/inductive biases during RL training seem likely to play a major role in the shape of the posterior policy.
We see empirically that RLHF models (like OAI’s instruct tuned models) do not behave like the original policy conditioned on a natural criteria (e.g. they become often almost deterministic).
Maybe there is a way to do RLHF while preserving the simulator nature of the policy, but the way OAI/Anthropic are doing it now does not, imo
Figuring out and posting about how RLHF and other methods ([online] decision transformer, IDA, rejection sampling, etc) modify the nature of simulators is very high priority. There’s an ongoing research project at Conjecture specifically about this, which is the main reason I didn’t emphasize it as a future topic in this sequence. Hopefully we’ll put out a post about our preliminary theoretical and empirical findings soon.
Some interesting threads:
RL with KL penalties better seen as Bayesian inference shows that the optimal policy when you hit a GPT with RL with a KL penalty weighted by 1 is actually equivalent to conditioning the policy on a criteria estimated by the reward model, which is compatible with the simulator formalism.
However, this doesn’t happen in current practice, because
1. both OAI and Anthropic use very small KL penalties (e.g. weighted by 0.001 in Anthropic’s paper—which in the Bayesian inference framework means updating on the “evidence” 1000 times) or maybe none at all
2. early stopping: the RL training does not converge to anything near optimality. Path dependence/distribution shift/inductive biases during RL training seem likely to play a major role in the shape of the posterior policy.
We see empirically that RLHF models (like OAI’s instruct tuned models) do not behave like the original policy conditioned on a natural criteria (e.g. they become often almost deterministic).
Maybe there is a way to do RLHF while preserving the simulator nature of the policy, but the way OAI/Anthropic are doing it now does not, imo