Adding some thoughts that came out of a conversation with Thomas Kwa:
Gradient hacking seems difficult. Humans have pretty weak introspective access to their goals. I have a hard time determining whether my goals have changed or if I have gained information about what they are. There isn’t a good reason to believe that the AIs we build will be different.
Doesn’t this post assume we have the transparency capabilities to verify the AI has human-value-preserving goals, which the AI can use? The strategy seems relevant if these tools verifiably generalize to smarter-than-human AIs, and its easy to build aligned human-level AIs.
Another strategy is to use intermittent oversight – i.e. get an amplified version of the current aligned model to (somehow) determine whether the upgraded model has the same objective before proceeding.
The intermittent oversight strategy does depend on some level of transparency. This is only one of the ideas I mentioned though (and it is not original). The post in general does not assume anything about our transparency capabilities.
Adding some thoughts that came out of a conversation with Thomas Kwa:
Gradient hacking seems difficult. Humans have pretty weak introspective access to their goals. I have a hard time determining whether my goals have changed or if I have gained information about what they are. There isn’t a good reason to believe that the AIs we build will be different.
Doesn’t this post assume we have the transparency capabilities to verify the AI has human-value-preserving goals, which the AI can use? The strategy seems relevant if these tools verifiably generalize to smarter-than-human AIs, and its easy to build aligned human-level AIs.
I’m guessing that you are referring to this:
The intermittent oversight strategy does depend on some level of transparency. This is only one of the ideas I mentioned though (and it is not original). The post in general does not assume anything about our transparency capabilities.