I talk about aligning to human preferences that change over time, and Gillian Hadfield has pointed out that human preferences are specific to the current societal equilibrium. (Here by “human preferences” we’re talking about things that humans will say/endorse right now, as opposed to true/normative values.)
I’ve also had discussions at CHAI about whether we should expect humans to have adversarial examples (usually vision, not values). Also at CHAI, we often expect humans to be wrong about what they want, that’s what motivates a lot of work on figuring out how to interpret human input pragmatically instead of literally.
My previous lower-key attempts to discuss this with people who work in ML-based AI safety were mostly met with silence, which is also confusing.
I can’t speak to all of the cases, but in the example you point out, you’re asking about a paper whose main author is not on this forum (or at least, I’ve never seen a contribution from him, though he could have a username I don’t know). People are busy, it’s hard to read everything, let alone respond to everything.
Is this not an obvious observation?
I personally think that the analogy is weaker than it seems. Humans have system 2 / explicit reasoning / knowing-what-they-know ability that should be able to detect when they are facing an “adversarial example”. I certainly expect people’s quick, intuitive value judgments to be horribly wrong anywhere outside of current environments; it’s not obvious to me that’s true for probabilities given by explicit reasoning.
Are safety researchers too focused on narrow technical problems to consider the bigger picture?
I think that was true for me, but also I was (and still am) new to AI safety, so I doubt this generalizes to researchers who have been in the area for longer.
Badly designed institutions are exerting pressures to not look in this direction?
It certainly feels harder to me to publish papers on this topic.
Something else I’m not thinking of?
I think that to the extent you are hoping to scale up safety in parallel with capabilities, which feels a lot more tractable than solving the full problem in one go, this is not a problem you have to deal with yet, and you can outsource it to the future. This seems to be the general approach of “ML-based AI safety”. I think the people that are trying to aim at the full problem without going through current techniques (eg. MIRI, you, Paul) do think about these sorts of problems.
I’ve also had discussions at CHAI about whether we should expect humans to have adversarial examples (usually vision, not values).
I noticed that in your recent FLI interview, you discussed applying the idea of distributional shifts to human values. Did you write about that or talk to anyone about it before I made my posts?
I can’t speak to all of the cases, but in the example you point out, you’re asking about a paper whose main author is not on this forum (or at least, I’ve never seen a contribution from him, though he could have a username I don’t know). People are busy, it’s hard to read everything, let alone respond to everything.
Oh, I didn’t realize that, and thought the paper was more of a team effort. However as far as I can tell there hasn’t been a lot of discussion about the paper online, and the comments I wrote under the AF post might be the only substantial public online comments on the paper, so “let alone respond to everything” doesn’t seem to make much sense here.
I certainly expect people’s quick, intuitive value judgments to be horribly wrong anywhere outside of current environments; it’s not obvious to me that’s true for probabilities given by explicit reasoning.
It seems plausible that given enough time and opportunities to discuss with other friendly humans, explicit reasoning can eventually converge upon correct judgments, but explicit reasoning can certainly be wrong very often in the short or even medium run, and even eventual convergence might happen only for a small fraction of all humans who are especially good at explicit reasoning. I think there are also likely analogies to adversarial examples for explicit human reasoning, in the form of arguments that are extremely persuasive but wrong.
But I guess humans do have a safety mechanism in that system 1 and system 2 can cross-check each other and make us feel confused when they don’t agree. But that doesn’t always work since these systems can be wrong in the same direction (motivated cognition is a thing that happens pretty often) or one system can be so confident that it overrides the other. (Also it’s not clear what the safe thing to do is (in terms of decision making) when we are confused about our values.)
This safety mechanism may work well enough to prevent exploitation of adversarial examples by other humans a lot of the time, but seems unlikely to hold up under heavier optimization power. (You could perhaps consider things like Nazism, conspiracy theories, and cults to be examples of successful exploitation by other humans.)
(I wonder if people have tried to apply the idea of heterogeneous systems cross-checking each other to adversarial examples in ML. Have you seen any literature on this?)
I think that to the extent you are hoping to scale up safety in parallel with capabilities, which feels a lot more tractable than solving the full problem in one go, this is not a problem you have to deal with yet, and you can outsource it to the future.
I guess that makes sense, but even then there should at least be an acknowledgement that the problem exists and needs to be solved in the future?
I noticed that in your recent FLI interview, you discussed applying the idea of distributional shifts to human values. Did you write about that or talk to anyone about it before I made my posts?
I had not written about it, but I had talked about it before your posts. If I remember correctly, I started finding the concept of distributional shifts very useful and applying it to everything around May of this year. Of course, I had been thinking about it recently because of your posts so I was more primed to bring it up during the podcast.
so “let alone respond to everything” doesn’t seem to make much sense here.
Fair point, this was more me noticing how much time I need to set aside for discussion here (which I think is valuable! But it is a time sink).
I think there are also likely analogies to adversarial examples for explicit human reasoning, in the form of arguments that are extremely persuasive but wrong.
Yeah, I think on balance I agree. But explicit human reasoning has generalized well to environments far outside of the ancestral environment, so this case is not as clear/obvious. (Adversarial examples in current ML models can be thought of as failures of generalization very close to the training distribution.)
(I wonder if people have tried to apply the idea of heterogeneous systems cross-checking each other to adversarial examples in ML. Have you seen any literature on this?)
Not exactly, but some related things that I’ve picked up from talking to people but don’t have citations for:
Adversarial examples tend to transfer across different model architectures for deep learning, so ensembles mostly don’t help.
There’s been work on training a system that can detect adversarial perturbations, separately from the model that classifies images.
There has been work on using non-deep learning methods for image classification to avoid adversarial examples. I believe approaches that are similar-ish to nearest neighbors tend to be more robust. I don’t know if anyone has tried combining these with deep learning methods to get the best of both worlds.
I guess that makes sense, but even then there should at least be an acknowledgement that the problem exists and needs to be solved in the future?
Agreed, but I think people are focusing on their own research agendas in public writing, since public writing is expensive. I wouldn’t engage as much as I do if I weren’t writing the newsletter. By default, if you agree with a post, you say nothing, and if you disagree, you leave a comment. So generally I take silence as a weak signal that people agree with the post.
This is primarily with public writing—if you talked to researchers in person in private, I would guess that most of them would explicitly agree that this is a problem worth thinking about. (I’m not sure they’d agree that the problem exists, more that it’s sufficiently plausible that we should think about it.)
I talk about aligning to human preferences that change over time, and Gillian Hadfield has pointed out that human preferences are specific to the current societal equilibrium. (Here by “human preferences” we’re talking about things that humans will say/endorse right now, as opposed to true/normative values.)
I’ve also had discussions at CHAI about whether we should expect humans to have adversarial examples (usually vision, not values). Also at CHAI, we often expect humans to be wrong about what they want, that’s what motivates a lot of work on figuring out how to interpret human input pragmatically instead of literally.
I can’t speak to all of the cases, but in the example you point out, you’re asking about a paper whose main author is not on this forum (or at least, I’ve never seen a contribution from him, though he could have a username I don’t know). People are busy, it’s hard to read everything, let alone respond to everything.
I personally think that the analogy is weaker than it seems. Humans have system 2 / explicit reasoning / knowing-what-they-know ability that should be able to detect when they are facing an “adversarial example”. I certainly expect people’s quick, intuitive value judgments to be horribly wrong anywhere outside of current environments; it’s not obvious to me that’s true for probabilities given by explicit reasoning.
I think that was true for me, but also I was (and still am) new to AI safety, so I doubt this generalizes to researchers who have been in the area for longer.
It certainly feels harder to me to publish papers on this topic.
I think that to the extent you are hoping to scale up safety in parallel with capabilities, which feels a lot more tractable than solving the full problem in one go, this is not a problem you have to deal with yet, and you can outsource it to the future. This seems to be the general approach of “ML-based AI safety”. I think the people that are trying to aim at the full problem without going through current techniques (eg. MIRI, you, Paul) do think about these sorts of problems.
Thanks, this is helpful to me.
I noticed that in your recent FLI interview, you discussed applying the idea of distributional shifts to human values. Did you write about that or talk to anyone about it before I made my posts?
Oh, I didn’t realize that, and thought the paper was more of a team effort. However as far as I can tell there hasn’t been a lot of discussion about the paper online, and the comments I wrote under the AF post might be the only substantial public online comments on the paper, so “let alone respond to everything” doesn’t seem to make much sense here.
It seems plausible that given enough time and opportunities to discuss with other friendly humans, explicit reasoning can eventually converge upon correct judgments, but explicit reasoning can certainly be wrong very often in the short or even medium run, and even eventual convergence might happen only for a small fraction of all humans who are especially good at explicit reasoning. I think there are also likely analogies to adversarial examples for explicit human reasoning, in the form of arguments that are extremely persuasive but wrong.
But I guess humans do have a safety mechanism in that system 1 and system 2 can cross-check each other and make us feel confused when they don’t agree. But that doesn’t always work since these systems can be wrong in the same direction (motivated cognition is a thing that happens pretty often) or one system can be so confident that it overrides the other. (Also it’s not clear what the safe thing to do is (in terms of decision making) when we are confused about our values.)
This safety mechanism may work well enough to prevent exploitation of adversarial examples by other humans a lot of the time, but seems unlikely to hold up under heavier optimization power. (You could perhaps consider things like Nazism, conspiracy theories, and cults to be examples of successful exploitation by other humans.)
(I wonder if people have tried to apply the idea of heterogeneous systems cross-checking each other to adversarial examples in ML. Have you seen any literature on this?)
I guess that makes sense, but even then there should at least be an acknowledgement that the problem exists and needs to be solved in the future?
I had not written about it, but I had talked about it before your posts. If I remember correctly, I started finding the concept of distributional shifts very useful and applying it to everything around May of this year. Of course, I had been thinking about it recently because of your posts so I was more primed to bring it up during the podcast.
Fair point, this was more me noticing how much time I need to set aside for discussion here (which I think is valuable! But it is a time sink).
Yeah, I think on balance I agree. But explicit human reasoning has generalized well to environments far outside of the ancestral environment, so this case is not as clear/obvious. (Adversarial examples in current ML models can be thought of as failures of generalization very close to the training distribution.)
Not exactly, but some related things that I’ve picked up from talking to people but don’t have citations for:
Adversarial examples tend to transfer across different model architectures for deep learning, so ensembles mostly don’t help.
There’s been work on training a system that can detect adversarial perturbations, separately from the model that classifies images.
There has been work on using non-deep learning methods for image classification to avoid adversarial examples. I believe approaches that are similar-ish to nearest neighbors tend to be more robust. I don’t know if anyone has tried combining these with deep learning methods to get the best of both worlds.
Agreed, but I think people are focusing on their own research agendas in public writing, since public writing is expensive. I wouldn’t engage as much as I do if I weren’t writing the newsletter. By default, if you agree with a post, you say nothing, and if you disagree, you leave a comment. So generally I take silence as a weak signal that people agree with the post.
This is primarily with public writing—if you talked to researchers in person in private, I would guess that most of them would explicitly agree that this is a problem worth thinking about. (I’m not sure they’d agree that the problem exists, more that it’s sufficiently plausible that we should think about it.)