There’s some possible world in which the following approach to interpretability works:
Put an AGI in a bunch of situations where it sometimes is incentivised to lie and sometimes is incentivised to tell the truth.
Train a lie detector which is given all its neural weights as input.
Then ask the AGI lots of questions about its plans.
One problem that this approach would face if we were using it to interpret a human is that the human might not consciously be aware of what their motivations are. For example, they may believe they are doing something for altruistic reasons, when in fact their unconscious motivations are primarily to look good. And the motivations which we are less conscious of are exactly those ones which it’s most disadvantageous for others to know about.
So would using such an interpretability technique on an AGI work? I guess one important question is something like: by default, would the AGI be systematically biased when talking about its plans, like humans are? Or is this something which only arises when there are selection pressures during training for hiding information?
One way we could avoid this problem: instead of a “lie detector”, you could train a “plan identifier”, which takes an AGI brain and tells you what that AGI is going to do in english. I’m a little less optimistic about this, since I think that gathering training data will be the big bottleneck either way, and getting enough data to train a plan identifier that’s smart enough to generalise to a wide range of plans seems pretty tricky. (By contrast, the lie detector might not need to know very much about the *content* of the lies).
There’s some possible world in which the following approach to interpretability works:
Put an AGI in a bunch of situations where it sometimes is incentivised to lie and sometimes is incentivised to tell the truth.
Train a lie detector which is given all its neural weights as input.
Then ask the AGI lots of questions about its plans.
One problem that this approach would face if we were using it to interpret a human is that the human might not consciously be aware of what their motivations are. For example, they may believe they are doing something for altruistic reasons, when in fact their unconscious motivations are primarily to look good. And the motivations which we are less conscious of are exactly those ones which it’s most disadvantageous for others to know about.
So would using such an interpretability technique on an AGI work? I guess one important question is something like: by default, would the AGI be systematically biased when talking about its plans, like humans are? Or is this something which only arises when there are selection pressures during training for hiding information?
One way we could avoid this problem: instead of a “lie detector”, you could train a “plan identifier”, which takes an AGI brain and tells you what that AGI is going to do in english. I’m a little less optimistic about this, since I think that gathering training data will be the big bottleneck either way, and getting enough data to train a plan identifier that’s smart enough to generalise to a wide range of plans seems pretty tricky. (By contrast, the lie detector might not need to know very much about the *content* of the lies).