ryan_greenblatt comments on AI Safety in a World of Vulnerable Machine Learning Systems

ryan_greenblatt 19 Jun 2023 4:40 UTC
LW: 1 AF: 1
0
AF

Collecting fresh human data for that is prohibitive, so we rely on a reward model—unfortunately that gets hacked.

Are you assuming that we can’t collect human data online as the policy optimizes against the reward model? (People currently do collect data online to avoid getting hacked like this.) This case seems probably hopeless to me without very strong regularization (I think you agree with this being mostly hopeless), but it also seems easy to avoid by just collecting human data online.