Wei Dai comments on Two Neglected Problems in Human-AI Safety

Wei Dai 20 Dec 2018 3:39 UTC
LW: 6 AF: 3
0
AF

Beyond the fact that humans have inputs on which they behave “badly” (from the perspective of our endorsed idealizations), what is the content of the analogy?

The update I made was from “humans probably have the equivalent of software bugs, i.e., bad behavior when dealing with rare edge cases” to “humans probably only behave sensibly in a small, hard to define region in the space of inputs, with a lot of bad behavior all around that region”. In other words the analogy seems to call for a much greater level of distrust in the safety of humans, and higher estimate of how difficult it would be to solve or avoid this problem.

I don’t think there is too much disagreement about that basic claim

I haven’t seen any explicit disagreement, but have seen AI safety approaches that seem to implicitly assume that humans are safe, and silence when I point out this analogy/claim to the people behind those approaches. (Besides the public example I linked to, I think you saw a private discussion between me and another AI safety researcher where this happened. And to be clear, I’m definitely not including you personally in this group.)

I remain supportive of work in this direction and will probably write about it in more detail some point, but don’t think there is much ambiguity about what I should work on.

I’m happy to see the first part of this statement, but the second part is a bit puzzling. Can you clarify what kind of people you think should work on this class of problems, and why you personally are not in that group? (Without that explanation, it seems to imply that people like you shouldn’t work on this class of problems, which might cover almost everyone who is potentially qualified to work on it. I would also be happy if you just stopped at the comma...)
- paulfchristiano 20 Dec 2018 19:19 UTC
  LW: 8 AF: 4
  0
  AF Parent
  Can you clarify what kind of people you think should work on this class of problems, and why you personally are not in that group? (Without that explanation, it seems to imply that people like you shouldn’t work on this class of problems, which might cover almost everyone who is potentially qualified to work on it. I would also be happy if you just stopped at the comma...)
  I’m one of the main people pushing what I regard as the most plausible approach to intent alignment, and have done a lot of thinking about that approach / built up a lot of hard-to-transfer intuition and state. So it seems like I have a strong comparative advantage on that problem.