TurnTrout comments on A shot at the diamond-alignment problem

TurnTrout 1 Jun 2023 20:06 UTC
LW: 2 AF: 2
0
AF
(Huh, I never saw this—maybe my weekly batched updates are glitched? I only saw this because I was on your profile for some other reason.)
I really appreciate these thoughts!
But you then propose an RL scheme. It seems to me like it’s still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.
I would say “that isn’t how on-policy RL works; it doesn’t just intelligently find increasingly high-reinforcement policies; which reinforcement events get ‘exploited’ depends on the exploration policy.” (You seem to guess that this is my response in the next sub-bullets.)
While I find the particular examples intuitive, the overall claim seems too good to be true: effectively, that the path-dependencies which differentiate GD learning from ideal Bayesian learning are exactly the tool we need for alignment.
shrug, too good to be true isn’t a causal reason for it to not work, of course, and I don’t see something suspicious in the correlations. Effective learning algorithms may indeed have nice properties we want, especially if some humans have those same nice properties due to their own effective learning algorithms!