We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems
To help me better model AI safety researchers (in order to better understand my comparative advantage as well as the current strategic situation with regard to AI risk), I’m interested to know why I seem to be the first person to point out or at least publicize this seemingly obvious parallel. (Humans can be seen as a form of machine intelligence made up at least in part of a bunch of ML-like modules and “designed” with little foresight. Why wouldn’t we have ML-like safety problems?) My previous lower-key attempts to discuss this with people who work in ML-based AI safety were mostly met with silence, which is also confusing. (See one example here.)
Is this not an obvious observation? Are safety researchers too focused on narrow technical problems to consider the bigger picture? Do they fear to speak due to PR concerns? Are they too optimistic about solving AI safety and don’t want to consider evidence that the problem may be harder than they think? Badly designed institutions are exerting pressures to not look in this direction? Something else I’m not thinking of?
I think this is under-discussed, but also that I have seen many discussions in this area. E.g. I have seen it come up and brought it up in the context of Paul’s research agenda, where success relies on humans being able to play their part safely in the amplification system. Many people say they are more worried about misuse than accident on the basis of the corruption issues (and much discussion about CEV and idealization, superstimuli, etc addresses the kind of path-dependence and adversarial search you mention).
However, those varied problems mostly aren’t formulated as ‘ML safety problems in humans’ (I have seen robustness and distributional shift discussion for Paul’s amplification, and daemons/wireheading/safe-self-modification for humans and human organizations), and that seems like a productive framing for systematic exploration, going through the known inventories and trying to see how they cross-apply.
I agree with all of this but I don’t think it addresses my central point/question. (I’m not sure if you were trying to, or just making a more tangential comment.) To rephrase, it seems to me that ‘ML safety problems in humans’ is a natural/obvious framing that makes clear that alignment to human users/operators is likely far from sufficient to ensure the safety of human-AI systems, that in some ways corrigibility is actually opposed to safety, and that there are likely technical angles of attack on these problems. It seems surprising that someone like me had to point out this framing to people who are intimately familiar with ML safety problems, and also surprising that they largely respond with silence.
in some ways corrigibility is actually opposed to safety
We can talk about “corrigible by X” for arbitrary X. I don’t think these considerations imply a tension between corrigibility and safety, they just suggest “humans in the real world” may not be the optimal X. You might prefer use an appropriate idealization of humans / humans in some safe environment / etc.
To the extent that even idealized humans are not perfectly safe (e.g., perhaps a white-box metaphilosophical approach is even safer), and that corrigibility seems to conflict with greater transparency and hence cooperation between AIs, there still seems to be some tension between corrigibility and safety even when X = idealized humans.
ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that’s worth pursuing if it looks feasible.
ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that’s worth pursuing if it looks feasible.
I talk about aligning to human preferences that change over time, and Gillian Hadfield has pointed out that human preferences are specific to the current societal equilibrium. (Here by “human preferences” we’re talking about things that humans will say/endorse right now, as opposed to true/normative values.)
I’ve also had discussions at CHAI about whether we should expect humans to have adversarial examples (usually vision, not values). Also at CHAI, we often expect humans to be wrong about what they want, that’s what motivates a lot of work on figuring out how to interpret human input pragmatically instead of literally.
My previous lower-key attempts to discuss this with people who work in ML-based AI safety were mostly met with silence, which is also confusing.
I can’t speak to all of the cases, but in the example you point out, you’re asking about a paper whose main author is not on this forum (or at least, I’ve never seen a contribution from him, though he could have a username I don’t know). People are busy, it’s hard to read everything, let alone respond to everything.
Is this not an obvious observation?
I personally think that the analogy is weaker than it seems. Humans have system 2 / explicit reasoning / knowing-what-they-know ability that should be able to detect when they are facing an “adversarial example”. I certainly expect people’s quick, intuitive value judgments to be horribly wrong anywhere outside of current environments; it’s not obvious to me that’s true for probabilities given by explicit reasoning.
Are safety researchers too focused on narrow technical problems to consider the bigger picture?
I think that was true for me, but also I was (and still am) new to AI safety, so I doubt this generalizes to researchers who have been in the area for longer.
Badly designed institutions are exerting pressures to not look in this direction?
It certainly feels harder to me to publish papers on this topic.
Something else I’m not thinking of?
I think that to the extent you are hoping to scale up safety in parallel with capabilities, which feels a lot more tractable than solving the full problem in one go, this is not a problem you have to deal with yet, and you can outsource it to the future. This seems to be the general approach of “ML-based AI safety”. I think the people that are trying to aim at the full problem without going through current techniques (eg. MIRI, you, Paul) do think about these sorts of problems.
I’ve also had discussions at CHAI about whether we should expect humans to have adversarial examples (usually vision, not values).
I noticed that in your recent FLI interview, you discussed applying the idea of distributional shifts to human values. Did you write about that or talk to anyone about it before I made my posts?
I can’t speak to all of the cases, but in the example you point out, you’re asking about a paper whose main author is not on this forum (or at least, I’ve never seen a contribution from him, though he could have a username I don’t know). People are busy, it’s hard to read everything, let alone respond to everything.
Oh, I didn’t realize that, and thought the paper was more of a team effort. However as far as I can tell there hasn’t been a lot of discussion about the paper online, and the comments I wrote under the AF post might be the only substantial public online comments on the paper, so “let alone respond to everything” doesn’t seem to make much sense here.
I certainly expect people’s quick, intuitive value judgments to be horribly wrong anywhere outside of current environments; it’s not obvious to me that’s true for probabilities given by explicit reasoning.
It seems plausible that given enough time and opportunities to discuss with other friendly humans, explicit reasoning can eventually converge upon correct judgments, but explicit reasoning can certainly be wrong very often in the short or even medium run, and even eventual convergence might happen only for a small fraction of all humans who are especially good at explicit reasoning. I think there are also likely analogies to adversarial examples for explicit human reasoning, in the form of arguments that are extremely persuasive but wrong.
But I guess humans do have a safety mechanism in that system 1 and system 2 can cross-check each other and make us feel confused when they don’t agree. But that doesn’t always work since these systems can be wrong in the same direction (motivated cognition is a thing that happens pretty often) or one system can be so confident that it overrides the other. (Also it’s not clear what the safe thing to do is (in terms of decision making) when we are confused about our values.)
This safety mechanism may work well enough to prevent exploitation of adversarial examples by other humans a lot of the time, but seems unlikely to hold up under heavier optimization power. (You could perhaps consider things like Nazism, conspiracy theories, and cults to be examples of successful exploitation by other humans.)
(I wonder if people have tried to apply the idea of heterogeneous systems cross-checking each other to adversarial examples in ML. Have you seen any literature on this?)
I think that to the extent you are hoping to scale up safety in parallel with capabilities, which feels a lot more tractable than solving the full problem in one go, this is not a problem you have to deal with yet, and you can outsource it to the future.
I guess that makes sense, but even then there should at least be an acknowledgement that the problem exists and needs to be solved in the future?
I noticed that in your recent FLI interview, you discussed applying the idea of distributional shifts to human values. Did you write about that or talk to anyone about it before I made my posts?
I had not written about it, but I had talked about it before your posts. If I remember correctly, I started finding the concept of distributional shifts very useful and applying it to everything around May of this year. Of course, I had been thinking about it recently because of your posts so I was more primed to bring it up during the podcast.
so “let alone respond to everything” doesn’t seem to make much sense here.
Fair point, this was more me noticing how much time I need to set aside for discussion here (which I think is valuable! But it is a time sink).
I think there are also likely analogies to adversarial examples for explicit human reasoning, in the form of arguments that are extremely persuasive but wrong.
Yeah, I think on balance I agree. But explicit human reasoning has generalized well to environments far outside of the ancestral environment, so this case is not as clear/obvious. (Adversarial examples in current ML models can be thought of as failures of generalization very close to the training distribution.)
(I wonder if people have tried to apply the idea of heterogeneous systems cross-checking each other to adversarial examples in ML. Have you seen any literature on this?)
Not exactly, but some related things that I’ve picked up from talking to people but don’t have citations for:
Adversarial examples tend to transfer across different model architectures for deep learning, so ensembles mostly don’t help.
There’s been work on training a system that can detect adversarial perturbations, separately from the model that classifies images.
There has been work on using non-deep learning methods for image classification to avoid adversarial examples. I believe approaches that are similar-ish to nearest neighbors tend to be more robust. I don’t know if anyone has tried combining these with deep learning methods to get the best of both worlds.
I guess that makes sense, but even then there should at least be an acknowledgement that the problem exists and needs to be solved in the future?
Agreed, but I think people are focusing on their own research agendas in public writing, since public writing is expensive. I wouldn’t engage as much as I do if I weren’t writing the newsletter. By default, if you agree with a post, you say nothing, and if you disagree, you leave a comment. So generally I take silence as a weak signal that people agree with the post.
This is primarily with public writing—if you talked to researchers in person in private, I would guess that most of them would explicitly agree that this is a problem worth thinking about. (I’m not sure they’d agree that the problem exists, more that it’s sufficiently plausible that we should think about it.)
I’m interested to know why I seem to be the first person to point out or at least publicize this seemingly obvious parallel. (Humans can be seen as a form of machine intelligence made up at least in part of a bunch of ML-like modules and “designed” with little foresight. Why wouldn’t we have ML-like safety problems?)
Beyond the fact that humans have inputs on which they behave “badly” (from the perspective of our endorsed idealizations), what is the content of the analogy? I don’t think there is too much disagreement about that basic claim (though there is disagreement about the importance/urgency of this problem relative to intent alignment); it’s something I’ve discussed and sometimes work on (mostly because it overlaps with my approach to intent alignment). But it seems like the menu of available solutions, and detailed nature of the problem, is quite different than in the case of ML security vulnerabilities. So for my part that’s why I haven’t emphasized this parallel.
Tangentially relevant restatement of my views: I agree that there exist inputs on which people behave badly, that deliberating “correctly” is hard (and much harder than manipulating values), that there may be technologies/insights/policies that would improve the chance that we deliberate correctly or ameliorate outside pressures that might corrupt our values / distort deliberation, etc. I think we do have a mild quantitative disagreement about the relative (importance)*(marginal tractability) of various problems. I remain supportive of work in this direction and will probably write about it in more detail some point, but don’t think there is much ambiguity about what I should work on.
Beyond the fact that humans have inputs on which they behave “badly” (from the perspective of our endorsed idealizations), what is the content of the analogy?
The update I made was from “humans probably have the equivalent of software bugs, i.e., bad behavior when dealing with rare edge cases” to “humans probably only behave sensibly in a small, hard to define region in the space of inputs, with a lot of bad behavior all around that region”. In other words the analogy seems to call for a much greater level of distrust in the safety of humans, and higher estimate of how difficult it would be to solve or avoid this problem.
I don’t think there is too much disagreement about that basic claim
I haven’t seen any explicit disagreement, but have seen AI safety approaches that seem to implicitly assume that humans are safe, and silence when I point out this analogy/claim to the people behind those approaches. (Besides the public example I linked to, I think you saw a private discussion between me and another AI safety researcher where this happened. And to be clear, I’m definitely not including you personally in this group.)
I remain supportive of work in this direction and will probably write about it in more detail some point, but don’t think there is much ambiguity about what I should work on.
I’m happy to see the first part of this statement, but the second part is a bit puzzling. Can you clarify what kind of people you think should work on this class of problems, and why you personally are not in that group? (Without that explanation, it seems to imply that people like you shouldn’t work on this class of problems, which might cover almost everyone who is potentially qualified to work on it. I would also be happy if you just stopped at the comma...)
Can you clarify what kind of people you think should work on this class of problems, and why you personally are not in that group? (Without that explanation, it seems to imply that people like you shouldn’t work on this class of problems, which might cover almost everyone who is potentially qualified to work on it. I would also be happy if you just stopped at the comma...)
I’m one of the main people pushing what I regard as the most plausible approach to intent alignment, and have done a lot of thinking about that approach / built up a lot of hard-to-transfer intuition and state. So it seems like I have a strong comparative advantage on that problem.
I certainly agree that humans might have critical failures of judgement in situations that are outside of some space of what is “comprehensible”. This is a special case of what I called “corrupt states” when talking about DRL, so I don’t feel like I have been ignoring the issue. Of course there is a lot more work to be done there (and I have some concrete research directions how to understand this better).
To help me better model AI safety researchers (in order to better understand my comparative advantage as well as the current strategic situation with regard to AI risk), I’m interested to know why I seem to be the first person to point out or at least publicize this seemingly obvious parallel. (Humans can be seen as a form of machine intelligence made up at least in part of a bunch of ML-like modules and “designed” with little foresight. Why wouldn’t we have ML-like safety problems?) My previous lower-key attempts to discuss this with people who work in ML-based AI safety were mostly met with silence, which is also confusing. (See one example here.)
Is this not an obvious observation? Are safety researchers too focused on narrow technical problems to consider the bigger picture? Do they fear to speak due to PR concerns? Are they too optimistic about solving AI safety and don’t want to consider evidence that the problem may be harder than they think? Badly designed institutions are exerting pressures to not look in this direction? Something else I’m not thinking of?
I think this is under-discussed, but also that I have seen many discussions in this area. E.g. I have seen it come up and brought it up in the context of Paul’s research agenda, where success relies on humans being able to play their part safely in the amplification system. Many people say they are more worried about misuse than accident on the basis of the corruption issues (and much discussion about CEV and idealization, superstimuli, etc addresses the kind of path-dependence and adversarial search you mention).
However, those varied problems mostly aren’t formulated as ‘ML safety problems in humans’ (I have seen robustness and distributional shift discussion for Paul’s amplification, and daemons/wireheading/safe-self-modification for humans and human organizations), and that seems like a productive framing for systematic exploration, going through the known inventories and trying to see how they cross-apply.
I agree with all of this but I don’t think it addresses my central point/question. (I’m not sure if you were trying to, or just making a more tangential comment.) To rephrase, it seems to me that ‘ML safety problems in humans’ is a natural/obvious framing that makes clear that alignment to human users/operators is likely far from sufficient to ensure the safety of human-AI systems, that in some ways corrigibility is actually opposed to safety, and that there are likely technical angles of attack on these problems. It seems surprising that someone like me had to point out this framing to people who are intimately familiar with ML safety problems, and also surprising that they largely respond with silence.
We can talk about “corrigible by X” for arbitrary X. I don’t think these considerations imply a tension between corrigibility and safety, they just suggest “humans in the real world” may not be the optimal X. You might prefer use an appropriate idealization of humans / humans in some safe environment / etc.
To the extent that even idealized humans are not perfectly safe (e.g., perhaps a white-box metaphilosophical approach is even safer), and that corrigibility seems to conflict with greater transparency and hence cooperation between AIs, there still seems to be some tension between corrigibility and safety even when X = idealized humans.
ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that’s worth pursuing if it looks feasible.
Yes.
I talk about aligning to human preferences that change over time, and Gillian Hadfield has pointed out that human preferences are specific to the current societal equilibrium. (Here by “human preferences” we’re talking about things that humans will say/endorse right now, as opposed to true/normative values.)
I’ve also had discussions at CHAI about whether we should expect humans to have adversarial examples (usually vision, not values). Also at CHAI, we often expect humans to be wrong about what they want, that’s what motivates a lot of work on figuring out how to interpret human input pragmatically instead of literally.
I can’t speak to all of the cases, but in the example you point out, you’re asking about a paper whose main author is not on this forum (or at least, I’ve never seen a contribution from him, though he could have a username I don’t know). People are busy, it’s hard to read everything, let alone respond to everything.
I personally think that the analogy is weaker than it seems. Humans have system 2 / explicit reasoning / knowing-what-they-know ability that should be able to detect when they are facing an “adversarial example”. I certainly expect people’s quick, intuitive value judgments to be horribly wrong anywhere outside of current environments; it’s not obvious to me that’s true for probabilities given by explicit reasoning.
I think that was true for me, but also I was (and still am) new to AI safety, so I doubt this generalizes to researchers who have been in the area for longer.
It certainly feels harder to me to publish papers on this topic.
I think that to the extent you are hoping to scale up safety in parallel with capabilities, which feels a lot more tractable than solving the full problem in one go, this is not a problem you have to deal with yet, and you can outsource it to the future. This seems to be the general approach of “ML-based AI safety”. I think the people that are trying to aim at the full problem without going through current techniques (eg. MIRI, you, Paul) do think about these sorts of problems.
Thanks, this is helpful to me.
I noticed that in your recent FLI interview, you discussed applying the idea of distributional shifts to human values. Did you write about that or talk to anyone about it before I made my posts?
Oh, I didn’t realize that, and thought the paper was more of a team effort. However as far as I can tell there hasn’t been a lot of discussion about the paper online, and the comments I wrote under the AF post might be the only substantial public online comments on the paper, so “let alone respond to everything” doesn’t seem to make much sense here.
It seems plausible that given enough time and opportunities to discuss with other friendly humans, explicit reasoning can eventually converge upon correct judgments, but explicit reasoning can certainly be wrong very often in the short or even medium run, and even eventual convergence might happen only for a small fraction of all humans who are especially good at explicit reasoning. I think there are also likely analogies to adversarial examples for explicit human reasoning, in the form of arguments that are extremely persuasive but wrong.
But I guess humans do have a safety mechanism in that system 1 and system 2 can cross-check each other and make us feel confused when they don’t agree. But that doesn’t always work since these systems can be wrong in the same direction (motivated cognition is a thing that happens pretty often) or one system can be so confident that it overrides the other. (Also it’s not clear what the safe thing to do is (in terms of decision making) when we are confused about our values.)
This safety mechanism may work well enough to prevent exploitation of adversarial examples by other humans a lot of the time, but seems unlikely to hold up under heavier optimization power. (You could perhaps consider things like Nazism, conspiracy theories, and cults to be examples of successful exploitation by other humans.)
(I wonder if people have tried to apply the idea of heterogeneous systems cross-checking each other to adversarial examples in ML. Have you seen any literature on this?)
I guess that makes sense, but even then there should at least be an acknowledgement that the problem exists and needs to be solved in the future?
I had not written about it, but I had talked about it before your posts. If I remember correctly, I started finding the concept of distributional shifts very useful and applying it to everything around May of this year. Of course, I had been thinking about it recently because of your posts so I was more primed to bring it up during the podcast.
Fair point, this was more me noticing how much time I need to set aside for discussion here (which I think is valuable! But it is a time sink).
Yeah, I think on balance I agree. But explicit human reasoning has generalized well to environments far outside of the ancestral environment, so this case is not as clear/obvious. (Adversarial examples in current ML models can be thought of as failures of generalization very close to the training distribution.)
Not exactly, but some related things that I’ve picked up from talking to people but don’t have citations for:
Adversarial examples tend to transfer across different model architectures for deep learning, so ensembles mostly don’t help.
There’s been work on training a system that can detect adversarial perturbations, separately from the model that classifies images.
There has been work on using non-deep learning methods for image classification to avoid adversarial examples. I believe approaches that are similar-ish to nearest neighbors tend to be more robust. I don’t know if anyone has tried combining these with deep learning methods to get the best of both worlds.
Agreed, but I think people are focusing on their own research agendas in public writing, since public writing is expensive. I wouldn’t engage as much as I do if I weren’t writing the newsletter. By default, if you agree with a post, you say nothing, and if you disagree, you leave a comment. So generally I take silence as a weak signal that people agree with the post.
This is primarily with public writing—if you talked to researchers in person in private, I would guess that most of them would explicitly agree that this is a problem worth thinking about. (I’m not sure they’d agree that the problem exists, more that it’s sufficiently plausible that we should think about it.)
Beyond the fact that humans have inputs on which they behave “badly” (from the perspective of our endorsed idealizations), what is the content of the analogy? I don’t think there is too much disagreement about that basic claim (though there is disagreement about the importance/urgency of this problem relative to intent alignment); it’s something I’ve discussed and sometimes work on (mostly because it overlaps with my approach to intent alignment). But it seems like the menu of available solutions, and detailed nature of the problem, is quite different than in the case of ML security vulnerabilities. So for my part that’s why I haven’t emphasized this parallel.
Tangentially relevant restatement of my views: I agree that there exist inputs on which people behave badly, that deliberating “correctly” is hard (and much harder than manipulating values), that there may be technologies/insights/policies that would improve the chance that we deliberate correctly or ameliorate outside pressures that might corrupt our values / distort deliberation, etc. I think we do have a mild quantitative disagreement about the relative (importance)*(marginal tractability) of various problems. I remain supportive of work in this direction and will probably write about it in more detail some point, but don’t think there is much ambiguity about what I should work on.
The update I made was from “humans probably have the equivalent of software bugs, i.e., bad behavior when dealing with rare edge cases” to “humans probably only behave sensibly in a small, hard to define region in the space of inputs, with a lot of bad behavior all around that region”. In other words the analogy seems to call for a much greater level of distrust in the safety of humans, and higher estimate of how difficult it would be to solve or avoid this problem.
I haven’t seen any explicit disagreement, but have seen AI safety approaches that seem to implicitly assume that humans are safe, and silence when I point out this analogy/claim to the people behind those approaches. (Besides the public example I linked to, I think you saw a private discussion between me and another AI safety researcher where this happened. And to be clear, I’m definitely not including you personally in this group.)
I’m happy to see the first part of this statement, but the second part is a bit puzzling. Can you clarify what kind of people you think should work on this class of problems, and why you personally are not in that group? (Without that explanation, it seems to imply that people like you shouldn’t work on this class of problems, which might cover almost everyone who is potentially qualified to work on it. I would also be happy if you just stopped at the comma...)
I’m one of the main people pushing what I regard as the most plausible approach to intent alignment, and have done a lot of thinking about that approach / built up a lot of hard-to-transfer intuition and state. So it seems like I have a strong comparative advantage on that problem.
I certainly agree that humans might have critical failures of judgement in situations that are outside of some space of what is “comprehensible”. This is a special case of what I called “corrupt states” when talking about DRL, so I don’t feel like I have been ignoring the issue. Of course there is a lot more work to be done there (and I have some concrete research directions how to understand this better).