Daniel Kokotajlo comments on Deep Deceptiveness

Daniel Kokotajlo 24 Mar 2023 12:34 UTC
8 points
1
I think Steven Byrnes made my point but better: The intuition I was trying to get at is that it’s possible to have an intelligent system which is applying its intelligence to avoid deception, as well as applying intelligence to get local goals. So it wouldn’t be fair to characterize it as “rigorously search the solution space for things that work to solve this problem, but ignore solutions that classify as deception” but rather as “rigorously search the solution space for things that work to solve this problem without being deceptive” This system would be very well aware of the true fact that deception is useful for achieving local goals; however, it’s global goals would penalize deception and so deception is not useful for achieving its global goals. It might have a deception classifier which can be gamed, but ‘gaming the deception classifier’ would trigger the classifier and so the system would be actively applying its intelligence to reduce the probability that it ends up gaming the deception classifier—it would be thinking about ways to improve the classifier, it would be cautious about strategies (incl. super-rigorous searches through solution space) that seem likely to game the classifier, etc.

Analogy (maybe not even an analogy): Suppose you have some humans who are NOT consequentialists. They are deontologists; they think that there are certain rules they just shouldn’t break, full stop, except in crazy circumstances maybe. They are running a business. Someone proposes the plan: “Aha, these pesky rules, how about we reframe what we are doing as a path through some space of nodes, and then brute search through the possible paths, and we commit beforehand to hiring contractors to carry out whatever steps this search turns up. That way we aren’t going to do anything immoral, all we are doing is subcontracting out to this search process + contractor setup.” Someone else: “Hmm, but isn’t that just a way to get around our constraints? Seems bad to me. We shouldn’t do that unless we have a way to also verify that the node-path doesn’t involve asking the contractor to break the rules.”
- Aaron_Scher 24 Mar 2023 20:52 UTC
  6 points
  0
  Parent
  Thanks for clarifying!
  I expect such a system would run into subsystem alignment problems. Getting such a system also seems about as hard as designing a corrigible system, insofar as “don’t be deceptive” is analogous to “be neutral about humans pressing stop button.”
- Daniel Kokotajlo 24 Mar 2023 12:34 UTC
  4 points
  0
  Parent
  To be clear I’m not sure this is possible, it may be fundamentally confused.