evhub comments on Thoughts on implementing corrigible robust alignment

evhub 26 Nov 2019 23:07 UTC
LW: 7 AF: 4
AF
I really enjoyed this post; thanks for writing this! Some comments:

the AGI uses its understanding of humans to try to figure out what a human would do in a hypothetical scenario.

I think that supervised amplification can also sort of be thought as falling into this category, in that you often want your model to be internally modeling what an HCH would do in a hypothetical scenario. Of course, if you’re training a model using supervised amplification, you might not actually get a model which is in fact just trying to guess what an HCH would do, but is instead doing something more strategic and/or deceptive, though in many cases the goal at least is to try and get something that’s just trying to approximate HCH.

So that suggests an approach of pre-loading this template database with a hardcoded model of a human, complete with moods, beliefs, and so on.

This is actually quite similar to an approach that Nevan Witchers at Google is working on, which is to hardcode a differentiable model of the reward function as a component in your network when doing RL. The idea there being very similar, which is to prevent the model from learning a proxy by giving it direct access to the actual structure of the reward function rather than just learning based on rewards that were observed during training. The two major difficulties I see with this style of approach, however, are that 1) it requires you to have an explicit differentiable model of the reward function and 2) it still requires the model to learn the policy and value (that is, how much future discounted reward the model expects to get using its current policy starting from some state) functions which could still allow for the introduction of misaligned proxies.