AdamGleave comments on AI Safety in a World of Vulnerable Machine Learning Systems

AdamGleave 20 Jun 2023 1:36 UTC
LW: 1 AF: 1
1
AF
Oh, we’re using terminology quite differently then. I would not call (a) reward hacking, as I view the model as being the reward (to the RL process), whereas humans are not providing reward at all (but rather some data that gets fed into a reward model’s learning process). I don’t especially care about what definitions we use here, but do wonder if this means we’re speaking past each other in other areas as well.