I didn’t read this whole post, but I thought it would be worth noting that I do actually think trying to align AIs to be reward seekers might improve the situation in some intermediate/bootstrap regimes because it might reduce the chance of scheming for long run objectives and we could maybe more easily manage safety issues with reward seekers. (The exact way in which the AI is a reward seeker will effect the safety profile: multiple things might be consistent with “wanting” to perform well on the training metric, e.g. wanting to be the kind of AI which is selected for etc.)
(The way I’m typically thinking about it looks somewhat different than the way you describe reward button alignment. E.g., I’m often imagining we’re still trying to make the AI myopic within RL episodes if we go for reward seeking as an alignment strategy. This could help to reduce risks of seizing control over the reward process.)
I didn’t read this whole post, but I thought it would be worth noting that I do actually think trying to align AIs to be reward seekers might improve the situation in some intermediate/bootstrap regimes because it might reduce the chance of scheming for long run objectives and we could maybe more easily manage safety issues with reward seekers. (The exact way in which the AI is a reward seeker will effect the safety profile: multiple things might be consistent with “wanting” to perform well on the training metric, e.g. wanting to be the kind of AI which is selected for etc.)
(The way I’m typically thinking about it looks somewhat different than the way you describe reward button alignment. E.g., I’m often imagining we’re still trying to make the AI myopic within RL episodes if we go for reward seeking as an alignment strategy. This could help to reduce risks of seizing control over the reward process.)