Strongly disagree that this reward signal is necessary or sufficient to solve ELK
I’m not sure where our disagreement is, but I think I may have gotten too loose with my phrasing in that quote, so let me rephrase into something I’m happier to defend:
One hypothetical strategy to solve ELK is “RL train the agent where it gets +1 reward for each accurate report of its beliefs”.
If you try this, you will have a lot of difficulty when you have to evaluate the reward function. In particular, you’ll need to determine if the AI accurately reported its belief—which is to say you’d need to have solved ELK.
Because of (2), (1) is infeasible as a solution to ELK.
Reward signals transduce into e.g. a value head in PPO
I’m not as familiar as I’d like to be with PPO, but that’s really cool! Could you link to a source where they show this about value heads? (I didn’t see anything about value heads or PPO in your linked texts.)
It seems totally plausible to me that you e.g. just provide sparse positive reward on truthful answers, and the policy gradients accrete into a finetuned model which accurately reports its beliefs.
Are you saying that instead of (1) above, you could do a (1′) which is “RL train the agent where it gets +100 (respectively, +0) reward for each accurate report of its beliefs 1% (respectively, 99%) of the time”? I agree (1′) will learn the same policies as (1), but I think the objections in (2) and (3) will still apply. Two further caveats:
You’d need to make sure the set of true statements that give reward is a representative sample. If you only check easy-to-verify things, the AI might learn “give correct answers to simple factual questions, but hallucinate answers to everything else”.
3) Because of (2), (1) is infeasible as a solution to ELK.
Disagree, same as before.
I’m not as familiar as I’d like to be with PPO, but that’s really cool! Could you link to a source where they show this about value heads? (I didn’t see anything about value heads or PPO in your linked texts.)
This is actually a consequence of the PPO update equation itself; see eq 12 in the original paper. Basically, advantage of policy π taking action a in state s to end up in new state s′ is the on-policy TD errorAπ(s,a):=(R(s)+γVπ(s′))−Vπ(s). The PPO update is proportional to the advantage, with additional terms for policy clipping so that the updated policy doesn’t go too far astray from the current policy.
So in a sparse reward regime (where R(s) is usually 0), most of the advantages are computed only as a function of the value estimator Vπ. The value estimator is itself usually a linear head on top of the base network, and it’s trained via RL.
The point of all this is that in the sparse reward regime using a common modern algorithm like PPO (as often used in RLHF), almost none of the policy gradients come directly from the reward function. Instead, we have to consider how reward events will train a value head, which concurrently trains a policy.
So if we’re reasoning about “the policy is optimized as a (mostly direct) function of how much reward it gets”, that’s a highly non-trivial claim. The claim might just be wrong. The way that a policy is optimized to output certain actions is not so trivial as “if the reward function doesn’t grade all the events properly, then the policy will be selected to exploit it”, because the reward events reinforce certain computational circuits in the value head/network, which will in turn reinforce and chisel certain circuits into the policy portion of the network. That’s what’s really happening, mechanistically.
It seems like you want to argue “PPO only chisels circuits which implement a direct translator/honest reporter, if the reinforcement schedule can perfectly judge AI honesty in any possible situation.” This claim sounds highly suspicious to me. How does our existing knowledge rule out “you provide reward events for being honest, and these events are usually correct, and the AI learns a circuit from its world-model to its outputs”?
I think the usual answer is “we want an ELK solution to work in the worst-case.” But then it’s still unclear that the “only… if” is true. I don’t think that the “if” is sufficient or necessary to get an ELK solution, and I don’t know how I could be confident about even sufficiency (whereas I do believe that it’s not necessary). “Ensure the reward function is ‘correct’ across all possible training situations” seems like a red herring to me.
I’m not sure where our disagreement is, but I think I may have gotten too loose with my phrasing in that quote, so let me rephrase into something I’m happier to defend:
One hypothetical strategy to solve ELK is “RL train the agent where it gets +1 reward for each accurate report of its beliefs”.
If you try this, you will have a lot of difficulty when you have to evaluate the reward function. In particular, you’ll need to determine if the AI accurately reported its belief—which is to say you’d need to have solved ELK.
Because of (2), (1) is infeasible as a solution to ELK.
I’m not as familiar as I’d like to be with PPO, but that’s really cool! Could you link to a source where they show this about value heads? (I didn’t see anything about value heads or PPO in your linked texts.)
Are you saying that instead of (1) above, you could do a (1′) which is “RL train the agent where it gets +100 (respectively, +0) reward for each accurate report of its beliefs 1% (respectively, 99%) of the time”? I agree (1′) will learn the same policies as (1), but I think the objections in (2) and (3) will still apply. Two further caveats:
You’d need to make sure the set of true statements that give reward is a representative sample. If you only check easy-to-verify things, the AI might learn “give correct answers to simple factual questions, but hallucinate answers to everything else”.
I understand it, sparse reward slows down training, e.g. “It results in a very sparse reward signal which will make learning slow”. (But on the third hand 1⁄100 is probably not that sparse to such an AI.)
Thanks for the additional effort and rephrasing!
Disagree, same as before.
This is actually a consequence of the PPO update equation itself; see eq 12 in the original paper. Basically, advantage of policy π taking action a in state s to end up in new state s′ is the on-policy TD errorAπ(s,a):=(R(s)+γVπ(s′))−Vπ(s). The PPO update is proportional to the advantage, with additional terms for policy clipping so that the updated policy doesn’t go too far astray from the current policy.
So in a sparse reward regime (where R(s) is usually 0), most of the advantages are computed only as a function of the value estimator Vπ. The value estimator is itself usually a linear head on top of the base network, and it’s trained via RL.
The point of all this is that in the sparse reward regime using a common modern algorithm like PPO (as often used in RLHF), almost none of the policy gradients come directly from the reward function. Instead, we have to consider how reward events will train a value head, which concurrently trains a policy.
So if we’re reasoning about “the policy is optimized as a (mostly direct) function of how much reward it gets”, that’s a highly non-trivial claim. The claim might just be wrong. The way that a policy is optimized to output certain actions is not so trivial as “if the reward function doesn’t grade all the events properly, then the policy will be selected to exploit it”, because the reward events reinforce certain computational circuits in the value head/network, which will in turn reinforce and chisel certain circuits into the policy portion of the network. That’s what’s really happening, mechanistically.
It seems like you want to argue “PPO only chisels circuits which implement a direct translator/honest reporter, if the reinforcement schedule can perfectly judge AI honesty in any possible situation.” This claim sounds highly suspicious to me. How does our existing knowledge rule out “you provide reward events for being honest, and these events are usually correct, and the AI learns a circuit from its world-model to its outputs”?
I think the usual answer is “we want an ELK solution to work in the worst-case.” But then it’s still unclear that the “only… if” is true. I don’t think that the “if” is sufficient or necessary to get an ELK solution, and I don’t know how I could be confident about even sufficiency (whereas I do believe that it’s not necessary). “Ensure the reward function is ‘correct’ across all possible training situations” seems like a red herring to me.