Isn’t it just the case that the human brain’s ‘interpretability technique’ is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life.
Maybe this is a crux? To my knowledge, we haven’t tried that hard to make interpretability techniques and probes robust, not in a way where ‘activations being easily monitorable’ is closely correlated with ‘playing training games well’ for the lack of better wording.
A result that comes to mind, but isn’t directly relevant, is this tweet from the real time hallucination detection paper. Co-training a LoRA adaptor and downstream probe for hallucination detection made the LLM more epistemically cautious with no other supervised training signals. Maybe we can use this probe to RL against hallucinations while training the probe at each step too?
I have yet to read your previous posts linked here. I imagine some of my questions will be answered once I find time to look through them haha.
Isn’t it just the case that the human brain’s ‘interpretability technique’ is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life.
I don’t think it’s that robust even in humans, despite the mitigation described in this post. (Without that mitigation, I think it would be hopeless.)
If we’re worried about a failure mode of the form “the interpretability technique has been routed around”, then that’s unrelated to “The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life”. For the failure mode that Yudkowsky & Zvi were complaining about, if that failure mode actually happened, there would still be an accurate model. It would just be an accurate model that is invisible to the interpretability technique.
I.e. the beliefs box would still be working fine, but the connection to desires would be weak or absent.
And I do think that happens plenty in the human world.
Maybe the best example (at least from my own perspective) is the social behavior of (many) smart autistic adults. [Copying from here:] The starting point is that innate social reactions (e.g. the physiological arousal triggered by eye contact) are so strong that they’re often overwhelming. People respond to that by (I think) relating to other people in a way that generally avoids triggering certain innate social reactions. This includes (famously) avoiding eye contact, but I think also includes various hard-to-describe unconscious attention-control strategies. So at the end of the day, neurotypical people will have an unconscious innate snap reaction to (e.g.) learning that someone is angry at them, whereas autistic people won’t have that snap reaction, because they have an unconscious coping strategy to avoid triggering it, that they’ve used since early childhood, because the reaction is so unpleasant. Of course, they’ll still understand intellectually perfectly well that the person is angry. As one consequence of that, autistic people (naturally) have trouble modeling how neurotypical people will react to different social situations, and conversely, neurotypical people will misunderstand and misinterpret the social behaviors of autistic people.
Still, socially-attentive smart autistic adults sometimes become good (indeed, sometimes better than average) at predicting the behavior of neurotypical people, if they put enough work into it.
(People can form predictive models of other people just using our general ability to figure things out, just like we can build predictive models of car engines or whatever.)
Ah this makes sense, thank you. Then I guess the crux is figuring out how to isolate the beliefs and desires box in AI systems so we can have this open loop. Gradient routing has potential here as cloud commented.
Another possible method that just occurred to me (no idea whether this is any good, inviting feedback): - Use interpretability technique to flag bad behaviors during RL for some task X.
- When bad behavior is flagged, train the model on a corpus that effectively represents the ‘opposite’ of this bad behavior (for example, if caught lying, train on corpus that induces honesty).
The intuition is that, we’d want this corpus to activate whatever parts of the network that represent a specific desire (we accept that we don’t know where this is), and that it is possible to come up with training documents that effectively updates ‘desires’ via SFT or other algorithms. I think methods/ideas in influence function/token level attribution may help with constructing such copra, or more direct ways of updating the desire parts of the network.
Steven, thanks for writing this!
Isn’t it just the case that the human brain’s ‘interpretability technique’ is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life. Maybe this is a crux? To my knowledge, we haven’t tried that hard to make interpretability techniques and probes robust, not in a way where ‘activations being easily monitorable’ is closely correlated with ‘playing training games well’ for the lack of better wording.
A result that comes to mind, but isn’t directly relevant, is this tweet from the real time hallucination detection paper. Co-training a LoRA adaptor and downstream probe for hallucination detection made the LLM more epistemically cautious with no other supervised training signals. Maybe we can use this probe to RL against hallucinations while training the probe at each step too?
I have yet to read your previous posts linked here. I imagine some of my questions will be answered once I find time to look through them haha.
I don’t think it’s that robust even in humans, despite the mitigation described in this post. (Without that mitigation, I think it would be hopeless.)
If we’re worried about a failure mode of the form “the interpretability technique has been routed around”, then that’s unrelated to “The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life”. For the failure mode that Yudkowsky & Zvi were complaining about, if that failure mode actually happened, there would still be an accurate model. It would just be an accurate model that is invisible to the interpretability technique.
I.e. the beliefs box would still be working fine, but the connection to desires would be weak or absent.
And I do think that happens plenty in the human world.
Maybe the best example (at least from my own perspective) is the social behavior of (many) smart autistic adults. [Copying from here:] The starting point is that innate social reactions (e.g. the physiological arousal triggered by eye contact) are so strong that they’re often overwhelming. People respond to that by (I think) relating to other people in a way that generally avoids triggering certain innate social reactions. This includes (famously) avoiding eye contact, but I think also includes various hard-to-describe unconscious attention-control strategies. So at the end of the day, neurotypical people will have an unconscious innate snap reaction to (e.g.) learning that someone is angry at them, whereas autistic people won’t have that snap reaction, because they have an unconscious coping strategy to avoid triggering it, that they’ve used since early childhood, because the reaction is so unpleasant. Of course, they’ll still understand intellectually perfectly well that the person is angry. As one consequence of that, autistic people (naturally) have trouble modeling how neurotypical people will react to different social situations, and conversely, neurotypical people will misunderstand and misinterpret the social behaviors of autistic people.
Still, socially-attentive smart autistic adults sometimes become good (indeed, sometimes better than average) at predicting the behavior of neurotypical people, if they put enough work into it.
(People can form predictive models of other people just using our general ability to figure things out, just like we can build predictive models of car engines or whatever.)
That’s just one example. I discuss other (maybe less controversial) examples in my Sympathy Reward post §4.1 and Approval Reward post §6.
Ah this makes sense, thank you. Then I guess the crux is figuring out how to isolate the beliefs and desires box in AI systems so we can have this open loop. Gradient routing has potential here as cloud commented.
Another possible method that just occurred to me (no idea whether this is any good, inviting feedback):
- Use interpretability technique to flag bad behaviors during RL for some task X.
- When bad behavior is flagged, train the model on a corpus that effectively represents the ‘opposite’ of this bad behavior (for example, if caught lying, train on corpus that induces honesty).
The intuition is that, we’d want this corpus to activate whatever parts of the network that represent a specific desire (we accept that we don’t know where this is), and that it is possible to come up with training documents that effectively updates ‘desires’ via SFT or other algorithms. I think methods/ideas in influence function/token level attribution may help with constructing such copra, or more direct ways of updating the desire parts of the network.