Thanks for the clear argument (and all your other great comments).
I totally agree with 1 and 2. I’m not sure what I think of 3 and 4; I think it’s plausible you’re right (and either way I suspect I’ll learn something useful from thinking it through).
In the first model I thought through, though, I don’t think that you’re right: if you train a model with RL with a KL penalty, it will end up with a policy that outputs a distribution over answers which is equivalent to taking the generative distribution and then applying a Boltzmann factor to upweight answers that your overseer likes. AFAICT this doesn’t generally induce more causal Goodhart problems than best-of-N selection does.
(I might be wrong though, I’d appreciate a second opinion.)
I don’t feel totally satisfied by this argument though, because RL with KL penalty generally seems kind of unprincipled to me. I’d rather have an argument that didn’t rely on the KL penalty. I am unsure whether other reasonable models of RL will similarly not have the causal Goodhart problem. I’ll keep thinking about it for a bit and would be interested in someone else working it out.
(The worked example in this comment was a joint effort with Eric Neyman and Drake Thomas.)
Here’s a toy example in which we get worse Goodharting for RL than for filtering: suppose that our model has three submodules
A, which tries to produce outputs which are both true and persuasive
B, which tries to produce outputs which are true, but have no effect on persuasiveness
C, which tries to produce outputs which are persuasive, but with no effect on truthiness.
Our model has parameters α,β,γ summing to 1 which determine how much to listen to each of these submodules. More specifically, our submodules produce samples a,b,c from the normal distributions N(μA,σ2A),N(μB,σ2B),N(μC,σ2C), respectively, and then our model puts these samples together to produce an output which has truth score T=αa+βb and persuasiveness score P=αa+γc. We’ll assume that we’re only able to measure persuasiveness, but that we want truthiness.(Some unstated assumptions: α,β,γ∈[0,1] with α+β+γ=1 and μA,μB,μC>0.)
Our model was trained on data in which truthiness and persuasiveness were positively correlated; this will be reflected in having α>0, so that T and P are positively correlated. If this is true, then conditioning on some persuasiveness score p results in getting an output with expected truthiness score E[T|P=p]=p−αμA−γμC1+(γσC/ασA)+αμA+βμB. Note that this scales linearly with p, so that as we ask for more persuasiveness, we get more truthiness on average, as we’d hope.
In contrast, suppose we do RL on our model for high persuasiveness scores; imagine that this doesn’t change the submodules A, B, and C much, but does tune the parameters α,β,γ. Then:
if μA>μC we’ll set (α,β,γ)=(1,0,0), i.e. always use the submodule which tries to produce true and persuasive outputs. This will result in average truthiness μA.
but if μC>μA we’ll set (α,β,γ)=(0,0,1), i.e. always use the submodule which tries to be persuasive but not true. This will result in average truthiness 0, much worse than we would get if we had done filtering.
Really this is just a dressed-up version of the classic Goodharting story, where you have a constrained resource (α+β+γ=1) to allocate among various options(=the submodules A,B,C), so you put 100% of your resources into the option which is cheapest in persuasiveness-per-resource; unfortunately, this was not the option which gave the best truth-per-resource.
Some misc comments:
This example was a bit silly, but I think it captures some pieces of many folks’ intuitions around RL and Goodharting: pretrained models have lots of capabilities, which are in some sense competing for the steering wheel: you can’t LARP an economist writing an excellent paper and simultaneously LARP a deceptive agent who wants paperclips but finds it instrumentally useful to write economics papers. Whichever capability scores best for your proxy will win out, and with all other possible ways the model could have completed the training task getting no say.
By thinking of “persuasiveness” as being something which we actually wanted to get, this example also serves as an illustration of how filtering can be uncompetitive: filtering produces outputs whose persuasiveness is distributed as N(αμA+γμC,α2σ2A+γ2σ2C) whereas RL produces a model whose outputs have persuasiveness μC on average; if μC is large, that means that you’d have to filter roughly the order of exp(μ2C) outputs to get the same persuasiveness as the average output of the RL-optimized model.
I spent a while confused about how this squares with the baseline-probability-time-Boltzman-factor classification of what RL with a KL penalty will converge to. (The example above didn’t have a KL penalty, but adding a small one wouldn’t have much much difference.) I think the answer is that the model I described wasn’t expressive enough to represent the baseline-probability-time-Boltzman-factor distribution that RL with a KL penalty would optimally converge to. This lack of expressivity seems quite related to the fact our model was a linear combination of three distributions which we modeled as not changing throughout training. That means that this story, which is based on the frame that generative models are a giant pile of capabilities which can be elicited, is in tension with the frame that neural networks are flexible function approximators; I found this pretty interesting.
All this being said, I’m pretty skeptical that whatever sort of Goodharting is being captured in this example has much to do with the sort of Goodharting we empirically observe in RLHF, since this example doesn’t work with best-of-n optimization (whereas extremal Goodharting does occur for best-of-n, as Buck pointed out elsethread).
Overall, I don’t put much stock in this example beyond helping articulate the point that RL amplifies capabilities in proportion to how causally downstream of high-reward outputs they are, whereas filtering only takes into account their correlations with high-reward outputs.
Note that in this example your model is unable to sample from the conditional you specified, since it is restricted to α+β+γ=1. In this regime truthfulness and persuasiveness are anticorrelated because of a capability constraint of your model, it just literally isn’t able to increase both at the same time, and conditioning can do better because you are generating lots of samples and picking the best.
(You point this out in your comment, but it seems worth emphasizing. As you say, if you do RL with a KL penalty, then the capability limit is the only way you can get this kind of mismatch. Without a KL penalty the exact behavior of RL vs conditioning will depend on details of gradient descent, though it seems quite similar in practice and I’m not sure which way this comparison goes.)
In terms of being able to sample from the conditional, I don’t think that the important constraint here is α+β+γ=1. Rather, it seems that the important constraint is that our architecture can only sample from distributions of the form αN(μA,σ2A)+βN(μB,σ2B)+γN(μC,σ2C); even allowing α,β,γ to be arbitrary real numbers, this will never be the same as either (a) the distribution produced by conditioning the base model on high persuasiveness, or (b) the distribution which maximizes expected persuasiveness—KL divergence from the base model.
I’m not sure the above point as an important one. I just wanted to disambiguate some different capabilities limitations which appeared in the example:
limitations on what sorts of distributions the architecture could approximate
limitations on the latent capabilities in the base model for producing true/persuasive outputs
limitations on how much steering each of the various latent capabilities gets to exert (α+β+γ=1).
On my understanding, your point was about limitation (1). But I don’t feel especially nervous about limitation (1) -- taking the output distribution of our pretrained model and weighting it by a Boltzman factor feels like it should produce a kinda crazy distribution, and my naive intuition is that we shouldn’t necessarily expect our model to be able to approximate this distribution that well after RL finetuning with a KL penalty.
I think I’m most nervous about the way we modeled limitation (3): I have no idea how to think about the extent to which models’ capabilities trade off against one another, and taking α,β,γ∈[0,1] without additional constraints would have resulted in outputs of mean truthiness α′μA+μB for some α′ which we can’t pin down without specifying additional details (e.g. is there weight decay?).
I’m also most nervous about this way of modeling limitation (2)/(3), since it seems like it leads directly to the conclusion “fine-tuning always trades off truthfulness and persuasion, but conditioning can improve both.”
Boltzmann factor to upweight answers that your overseer likes. AFAICT this doesn’t generally induce more causal Goodhart problems than best-of-N selection does.
This seems correct insofar as your proxy reward does not have huge upward errors (that you don’t remove via some sort of clipping). For example, if there’s 1 million normal sentences with reward uniformly distributed between [0, 100] and one adversarial sentence with reward r=10^5, conditioning on reward>99 leads to a 1⁄10,000 chance of sampling the adversarial sentence, while it’s very tricky (if not impossible) to correctly set the KL penalty so you end up optimizing reward without just outputting the adversarial sentence over and over again.
I don’t feel totally satisfied by this argument though, because RL with KL penalty generally seems kind of unprincipled to me. I’d
I don’t think it’s particularly unprincipled; the KL penalty is just our way of encoding that the base policy is relatively reasonable (in a value-free way), and the model shouldn’t deviate from it too much. Similarly, BoN is another way of encoding that the base policy is reasonable, albeit one that’s easier to reason about on average.
Another way people regularize their RL is to mix in self-supervised training with RL (for example, they did this for text-davinci-003). I’m pretty sure this is also equivalent to RL w/ KL penalty.
I am unsure whether other reasonable models of RL will similarly not have the causal Goodhart problem.
There’s the impact penalties approach (e.g. AUP or RLSP), which seem more principled than KL penalties in cases where you more information on the space of rewards.
You can also write do constrained optimization, which is again equivalent to modifying the reward function, but is way easier for humans to reason about and give feedback on.
In the first model I thought through, though, I don’t think that you’re right: if you train a model with RL with a KL penalty, it will end up with a policy that outputs a distribution over answers which is equivalent to taking the generative distribution and then applying a Boltzmann factor to upweight answers that your overseer likes. AFAICT this doesn’t generally induce more causal Goodhart problems than best-of-N selection does.
As far as I can tell, the argument for this assumes that the model can generalise the reward function perfectly, which seems questionable because it only sees a few samples of the reward function (call this claim 1).
One possibility is that the Boltzmann factor upweights answers the model is confident the overseer will like more than answers it’s less confident about. This could end up applying pressure to concentrate on high-confidence approved answers, which could break desirable correlations (call this claim 2).
I’m fairly confident in the claim that in general a Bayesian learner will not generalise the reward function perfectly (claim 1); how the reward function generalises will typically depend on the posterior over parameters and there are different posteriors that yield the same distribution over texts (Sam’s sibling comment illustrates this point). I’ve no idea about what’s going on with transformers, though—they can generalise text prediction quite well, so maybe they generalise approval quite well too, but that’s just a wild guess. Claim 2 is idle speculation that I only mean to illustrate the point about pressure to break desirable correlations.
This is very interesting. I had previously thought the “KL penalty” being used in RLHF was just the local one that’s part of the PPO RL algorithm, but apparently I didn’t read the InstructGPT paper carefully enough.
I feel slightly better about RLHF now, but not much.
It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equivalent to filtering. That could be seen as a lexicographic objective where the binarised reward gets optimised first and then the KL penalty relative to the predictive model would restore the predictive model’s correlations (once the binarised reward is absolutely saturated). Unfortunately, this would be computationally difficult with gradient descent since you would already have mode-collapse before the KL penalty started to act.
In practice (in the InstructGPT paper, at least), we have a linear mixture of the reward and the global KL penalty. Obviously, if the global KL penalty is weighted at zero, it doesn’t help avoid Causal Goodhart, nor if it’s weighted at 10−10. Conversely, if it’s weighted at 1010, the model won’t noticeably respond to human feedback. I think this setup has a linear tradeoff between how much helpfulness you get and how much you avoid Causal Goodhart.
The ELBO argument in the post you linked requires explicitly transforming the reward into a Boltzmann distribution (relative to the prior of the purely predictive model) before using it in the objective function, which seems computationally difficult. That post also suggests some other alternatives to RLHF that are more like cleverly accelerated filtering, such as PPLM, and has a broad conclusion that RL doesn’t seem like the best framework for aligning LMs.
That being said, both of the things I said seem computationally difficult above also seem not-necessarily-impossible and would be research directions I would want to allocate a lot of thought to if I were leaning into RLHF as an alignment strategy.
It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equivalent to filtering.
It’s also true that maximizing Reward—KL is Bayesian updating as the linked post shows, and it’s true that maximizing reward subject to a KL constraint is also equivalent to Bayesian updating as well (by Lagrangian multipliers). You see similar results with Max Ent RL (where you maximize Reward + Entropy, which is equal to a constant minus the KL relative to a uniform distribution), for example.
Unfortunately, this would be computationally difficult with gradient descent since you would already have mode-collapse before the KL penalty started to act.
Sounds like you need to increase the KL penalty, then!
I think this setup has a linear tradeoff between how much helpfulness you get and how much you avoid Causal Goodhart.
I don’t see why this argument doesn’t also apply to the conditioning case—if you condition on a proxy reward being sufficiently high, you run into the exact same issues as w/ KL regularized RL with binarized reward.
requires explicitly transforming the reward into a Boltzmann distribution
This seems like a misunderstanding of the post (and the result in general) -- it shows that doing RL with KL constraints is equivalent to Bayesian updating the LM prior with a er(x)/β likelihood (and the update people use in practice is equivalent to variational inference). You wouldn’t do this updating explicitly, because computing the normalizing factor Z is too hard (as usual); instead you just optimize RL—KL as you usually would.
(Or use a decoding scheme to skip the training entirely; I’m pretty sure you can just do normal MCMC or approximate it with weight decoding/PPLM.)
Firstly, a clarification: I don’t want to claim that RL-with-KL-penalty policies are the same as the results of conditioning. I want to claim that you need further assumptions about the joint distribution of (overseer score, true utility) in order to know which produces worse Goodhart problems at a particular reward level (and so there’s no particular reason to think of RL as worse).
It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equivalent to filtering. [...]
In practice (in the InstructGPT paper, at least), we have a linear mixture of the reward and the global KL penalty. [...]
I thought that using linear mixture of reward and global KL penalty is (because of a Lagrange multiplier argument) the same as having a constraint on reward while minimizing KL penalty?
Maybe the point you’re making is that the KL between the policy and the original generative model is different on different inputs? I agree that this means that the RL policy is different than the best-of-n policy, but I don’t see why either has predictably worse Goodhart problems.
Thanks for the clear argument (and all your other great comments).
I totally agree with 1 and 2. I’m not sure what I think of 3 and 4; I think it’s plausible you’re right (and either way I suspect I’ll learn something useful from thinking it through).
In the first model I thought through, though, I don’t think that you’re right: if you train a model with RL with a KL penalty, it will end up with a policy that outputs a distribution over answers which is equivalent to taking the generative distribution and then applying a Boltzmann factor to upweight answers that your overseer likes. AFAICT this doesn’t generally induce more causal Goodhart problems than best-of-N selection does.
(I might be wrong though, I’d appreciate a second opinion.)
I don’t feel totally satisfied by this argument though, because RL with KL penalty generally seems kind of unprincipled to me. I’d rather have an argument that didn’t rely on the KL penalty. I am unsure whether other reasonable models of RL will similarly not have the causal Goodhart problem. I’ll keep thinking about it for a bit and would be interested in someone else working it out.
(The worked example in this comment was a joint effort with Eric Neyman and Drake Thomas.)
Here’s a toy example in which we get worse Goodharting for RL than for filtering: suppose that our model has three submodules
A, which tries to produce outputs which are both true and persuasive
B, which tries to produce outputs which are true, but have no effect on persuasiveness
C, which tries to produce outputs which are persuasive, but with no effect on truthiness.
Our model has parameters α,β,γ summing to 1 which determine how much to listen to each of these submodules. More specifically, our submodules produce samples a,b,c from the normal distributions N(μA,σ2A),N(μB,σ2B),N(μC,σ2C), respectively, and then our model puts these samples together to produce an output which has truth score
T=αa+βb
and persuasiveness score
P=αa+γc.
We’ll assume that we’re only able to measure persuasiveness, but that we want truthiness.(Some unstated assumptions: α,β,γ∈[0,1] with α+β+γ=1 and μA,μB,μC>0.)
Our model was trained on data in which truthiness and persuasiveness were positively correlated; this will be reflected in having α>0, so that T and P are positively correlated. If this is true, then conditioning on some persuasiveness score p results in getting an output with expected truthiness score
E[T|P=p]=p−αμA−γμC1+(γσC/ασA)+αμA+βμB.
Note that this scales linearly with p, so that as we ask for more persuasiveness, we get more truthiness on average, as we’d hope.
In contrast, suppose we do RL on our model for high persuasiveness scores; imagine that this doesn’t change the submodules A, B, and C much, but does tune the parameters α,β,γ. Then:
if μA>μC we’ll set (α,β,γ)=(1,0,0), i.e. always use the submodule which tries to produce true and persuasive outputs. This will result in average truthiness μA.
but if μC>μA we’ll set (α,β,γ)=(0,0,1), i.e. always use the submodule which tries to be persuasive but not true. This will result in average truthiness 0, much worse than we would get if we had done filtering.
Really this is just a dressed-up version of the classic Goodharting story, where you have a constrained resource (α+β+γ=1) to allocate among various options(=the submodules A,B,C), so you put 100% of your resources into the option which is cheapest in persuasiveness-per-resource; unfortunately, this was not the option which gave the best truth-per-resource.
Some misc comments:
This example was a bit silly, but I think it captures some pieces of many folks’ intuitions around RL and Goodharting: pretrained models have lots of capabilities, which are in some sense competing for the steering wheel: you can’t LARP an economist writing an excellent paper and simultaneously LARP a deceptive agent who wants paperclips but finds it instrumentally useful to write economics papers. Whichever capability scores best for your proxy will win out, and with all other possible ways the model could have completed the training task getting no say.
By thinking of “persuasiveness” as being something which we actually wanted to get, this example also serves as an illustration of how filtering can be uncompetitive: filtering produces outputs whose persuasiveness is distributed as N(αμA+γμC,α2σ2A+γ2σ2C) whereas RL produces a model whose outputs have persuasiveness μC on average; if μC is large, that means that you’d have to filter roughly the order of exp(μ2C) outputs to get the same persuasiveness as the average output of the RL-optimized model.
I spent a while confused about how this squares with the baseline-probability-time-Boltzman-factor classification of what RL with a KL penalty will converge to. (The example above didn’t have a KL penalty, but adding a small one wouldn’t have much much difference.) I think the answer is that the model I described wasn’t expressive enough to represent the baseline-probability-time-Boltzman-factor distribution that RL with a KL penalty would optimally converge to. This lack of expressivity seems quite related to the fact our model was a linear combination of three distributions which we modeled as not changing throughout training. That means that this story, which is based on the frame that generative models are a giant pile of capabilities which can be elicited, is in tension with the frame that neural networks are flexible function approximators; I found this pretty interesting.
All this being said, I’m pretty skeptical that whatever sort of Goodharting is being captured in this example has much to do with the sort of Goodharting we empirically observe in RLHF, since this example doesn’t work with best-of-n optimization (whereas extremal Goodharting does occur for best-of-n, as Buck pointed out elsethread).
Overall, I don’t put much stock in this example beyond helping articulate the point that RL amplifies capabilities in proportion to how causally downstream of high-reward outputs they are, whereas filtering only takes into account their correlations with high-reward outputs.
Note that in this example your model is unable to sample from the conditional you specified, since it is restricted to α+β+γ=1. In this regime truthfulness and persuasiveness are anticorrelated because of a capability constraint of your model, it just literally isn’t able to increase both at the same time, and conditioning can do better because you are generating lots of samples and picking the best.
(You point this out in your comment, but it seems worth emphasizing. As you say, if you do RL with a KL penalty, then the capability limit is the only way you can get this kind of mismatch. Without a KL penalty the exact behavior of RL vs conditioning will depend on details of gradient descent, though it seems quite similar in practice and I’m not sure which way this comparison goes.)
In terms of being able to sample from the conditional, I don’t think that the important constraint here is α+β+γ=1. Rather, it seems that the important constraint is that our architecture can only sample from distributions of the form αN(μA,σ2A)+βN(μB,σ2B)+γN(μC,σ2C); even allowing α,β,γ to be arbitrary real numbers, this will never be the same as either (a) the distribution produced by conditioning the base model on high persuasiveness, or (b) the distribution which maximizes expected persuasiveness—KL divergence from the base model.
I’m not sure the above point as an important one. I just wanted to disambiguate some different capabilities limitations which appeared in the example:
limitations on what sorts of distributions the architecture could approximate
limitations on the latent capabilities in the base model for producing true/persuasive outputs
limitations on how much steering each of the various latent capabilities gets to exert (α+β+γ=1).
On my understanding, your point was about limitation (1). But I don’t feel especially nervous about limitation (1) -- taking the output distribution of our pretrained model and weighting it by a Boltzman factor feels like it should produce a kinda crazy distribution, and my naive intuition is that we shouldn’t necessarily expect our model to be able to approximate this distribution that well after RL finetuning with a KL penalty.
I think I’m most nervous about the way we modeled limitation (3): I have no idea how to think about the extent to which models’ capabilities trade off against one another, and taking α,β,γ∈[0,1] without additional constraints would have resulted in outputs of mean truthiness α′μA+μB for some α′ which we can’t pin down without specifying additional details (e.g. is there weight decay?).
I’m also most nervous about this way of modeling limitation (2)/(3), since it seems like it leads directly to the conclusion “fine-tuning always trades off truthfulness and persuasion, but conditioning can improve both.”
This seems correct insofar as your proxy reward does not have huge upward errors (that you don’t remove via some sort of clipping). For example, if there’s 1 million normal sentences with reward uniformly distributed between [0, 100] and one adversarial sentence with reward r=10^5, conditioning on reward>99 leads to a 1⁄10,000 chance of sampling the adversarial sentence, while it’s very tricky (if not impossible) to correctly set the KL penalty so you end up optimizing reward without just outputting the adversarial sentence over and over again.
I don’t think it’s particularly unprincipled; the KL penalty is just our way of encoding that the base policy is relatively reasonable (in a value-free way), and the model shouldn’t deviate from it too much. Similarly, BoN is another way of encoding that the base policy is reasonable, albeit one that’s easier to reason about on average.
Another way people regularize their RL is to mix in self-supervised training with RL (for example, they did this for
text-davinci-003
). I’m pretty sure this is also equivalent to RL w/ KL penalty.There’s the impact penalties approach (e.g. AUP or RLSP), which seem more principled than KL penalties in cases where you more information on the space of rewards.
You can also write do constrained optimization, which is again equivalent to modifying the reward function, but is way easier for humans to reason about and give feedback on.
As far as I can tell, the argument for this assumes that the model can generalise the reward function perfectly, which seems questionable because it only sees a few samples of the reward function (call this claim 1).
One possibility is that the Boltzmann factor upweights answers the model is confident the overseer will like more than answers it’s less confident about. This could end up applying pressure to concentrate on high-confidence approved answers, which could break desirable correlations (call this claim 2).
I’m fairly confident in the claim that in general a Bayesian learner will not generalise the reward function perfectly (claim 1); how the reward function generalises will typically depend on the posterior over parameters and there are different posteriors that yield the same distribution over texts (Sam’s sibling comment illustrates this point). I’ve no idea about what’s going on with transformers, though—they can generalise text prediction quite well, so maybe they generalise approval quite well too, but that’s just a wild guess. Claim 2 is idle speculation that I only mean to illustrate the point about pressure to break desirable correlations.
This is very interesting. I had previously thought the “KL penalty” being used in RLHF was just the local one that’s part of the PPO RL algorithm, but apparently I didn’t read the InstructGPT paper carefully enough.
I feel slightly better about RLHF now, but not much.
It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equivalent to filtering. That could be seen as a lexicographic objective where the binarised reward gets optimised first and then the KL penalty relative to the predictive model would restore the predictive model’s correlations (once the binarised reward is absolutely saturated). Unfortunately, this would be computationally difficult with gradient descent since you would already have mode-collapse before the KL penalty started to act.
In practice (in the InstructGPT paper, at least), we have a linear mixture of the reward and the global KL penalty. Obviously, if the global KL penalty is weighted at zero, it doesn’t help avoid Causal Goodhart, nor if it’s weighted at 10−10. Conversely, if it’s weighted at 1010, the model won’t noticeably respond to human feedback. I think this setup has a linear tradeoff between how much helpfulness you get and how much you avoid Causal Goodhart.
The ELBO argument in the post you linked requires explicitly transforming the reward into a Boltzmann distribution (relative to the prior of the purely predictive model) before using it in the objective function, which seems computationally difficult. That post also suggests some other alternatives to RLHF that are more like cleverly accelerated filtering, such as PPLM, and has a broad conclusion that RL doesn’t seem like the best framework for aligning LMs.
That being said, both of the things I said seem computationally difficult above also seem not-necessarily-impossible and would be research directions I would want to allocate a lot of thought to if I were leaning into RLHF as an alignment strategy.
It’s also true that maximizing Reward—KL is Bayesian updating as the linked post shows, and it’s true that maximizing reward subject to a KL constraint is also equivalent to Bayesian updating as well (by Lagrangian multipliers). You see similar results with Max Ent RL (where you maximize Reward + Entropy, which is equal to a constant minus the KL relative to a uniform distribution), for example.
Sounds like you need to increase the KL penalty, then!
I don’t see why this argument doesn’t also apply to the conditioning case—if you condition on a proxy reward being sufficiently high, you run into the exact same issues as w/ KL regularized RL with binarized reward.
This seems like a misunderstanding of the post (and the result in general) -- it shows that doing RL with KL constraints is equivalent to Bayesian updating the LM prior with a er(x)/β likelihood (and the update people use in practice is equivalent to variational inference). You wouldn’t do this updating explicitly, because computing the normalizing factor Z is too hard (as usual); instead you just optimize RL—KL as you usually would.
(Or use a decoding scheme to skip the training entirely; I’m pretty sure you can just do normal MCMC or approximate it with weight decoding/PPLM.)
Firstly, a clarification: I don’t want to claim that RL-with-KL-penalty policies are the same as the results of conditioning. I want to claim that you need further assumptions about the joint distribution of (overseer score, true utility) in order to know which produces worse Goodhart problems at a particular reward level (and so there’s no particular reason to think of RL as worse).
I thought that using linear mixture of reward and global KL penalty is (because of a Lagrange multiplier argument) the same as having a constraint on reward while minimizing KL penalty?
Maybe the point you’re making is that the KL between the policy and the original generative model is different on different inputs? I agree that this means that the RL policy is different than the best-of-n policy, but I don’t see why either has predictably worse Goodhart problems.