Reading this I’m coming away with several distinct objections that I feel make the point that AI control is hard and give no practical short term tools.
The first objection is that it seems impossible to determine, from the perspective of system 1, whether system 2 is working in a friendly way or not. In particular, it seems like you are suggesting that a friendly AI system is likely to deceive us for our own benefit. However, this makes it more difficult to distinguish “friendly” and “unfriendly” AI systems! The core problem with friendliness I think is that we do not actually know our own values. In order to design “friendly” systems we need reliable signals of friendliness that are easier to understand and measure. If your point holds and is likely to be true of AI systems, then that takes away the tool of “honesty” which is somewhat easy to understand and verify.
The second objection is that in the evolutionary case, there is a necessary slowness to the iteration process. Changes in brain architecture must be very slow and changes in culture can be faster, but not so fast that they change many times a generation. This means there’s a reasonable amount of time to test many use cases and to see success and failure even of low probability events before an adaptation is adopted. While technologies are adopted gradually in the AI case, uglyedgecases commonly occur and have to be fixed post-hoc, even when the systems they are edge cases of have been reliable for years or millions of test cases in advance. The entire problem of friendliness is to be able to identify these unknown unknowns, and the core mechanism solving that in the human case seems like slow iteration speed, which is probably not viable due to competitive pressure.
Third, system 1 had essentially no active role in shaping system 2. Humans did not reflectively sit down and decide to become intelligent. In particular, that means that many of the details of this solution aren’t accessible to us. We don’t have a textbook written down by monkeys talking about what makes human brains different and when humans and monkeys had good and bad times together. In fact our knowledge of the human brain is extremely limited, to the point where we don’t even have better ways of talking about the distinctions made in this post than saying “system 1” and “system 2″ and hand-waving over the fact that these aren’t really distinct processes!
Overall the impression I get is of a list of reasons that even if things seem to be going poorly without a central plan, there is a possibility that this will work out in the end. I don’t think that this is bad reasoning, or even particularly unlikely. However I also don’t think it’s highly certain nor do I think that it’s surprising. The problem of AI safety is to have concrete tools that can increase our confidence to much higher levels about the behavior of designed systems before seeing those systems work in practice on data that may be significantly different than we have access to. I’m not sure how this metaphor helps with those goals and I don’t find myself adjusting my prior beliefs at all (since I’ve always thought there was some significant chance that things would work out okay on their own—just not a high enough chance)
Reading this I’m coming away with several distinct objections that I feel make the point that AI control is hard and give no practical short term tools.
The first objection is that it seems impossible to determine, from the perspective of system 1, whether system 2 is working in a friendly way or not. In particular, it seems like you are suggesting that a friendly AI system is likely to deceive us for our own benefit. However, this makes it more difficult to distinguish “friendly” and “unfriendly” AI systems! The core problem with friendliness I think is that we do not actually know our own values. In order to design “friendly” systems we need reliable signals of friendliness that are easier to understand and measure. If your point holds and is likely to be true of AI systems, then that takes away the tool of “honesty” which is somewhat easy to understand and verify.
The second objection is that in the evolutionary case, there is a necessary slowness to the iteration process. Changes in brain architecture must be very slow and changes in culture can be faster, but not so fast that they change many times a generation. This means there’s a reasonable amount of time to test many use cases and to see success and failure even of low probability events before an adaptation is adopted. While technologies are adopted gradually in the AI case, ugly edge cases commonly occur and have to be fixed post-hoc, even when the systems they are edge cases of have been reliable for years or millions of test cases in advance. The entire problem of friendliness is to be able to identify these unknown unknowns, and the core mechanism solving that in the human case seems like slow iteration speed, which is probably not viable due to competitive pressure.
Third, system 1 had essentially no active role in shaping system 2. Humans did not reflectively sit down and decide to become intelligent. In particular, that means that many of the details of this solution aren’t accessible to us. We don’t have a textbook written down by monkeys talking about what makes human brains different and when humans and monkeys had good and bad times together. In fact our knowledge of the human brain is extremely limited, to the point where we don’t even have better ways of talking about the distinctions made in this post than saying “system 1” and “system 2″ and hand-waving over the fact that these aren’t really distinct processes!
Overall the impression I get is of a list of reasons that even if things seem to be going poorly without a central plan, there is a possibility that this will work out in the end. I don’t think that this is bad reasoning, or even particularly unlikely. However I also don’t think it’s highly certain nor do I think that it’s surprising. The problem of AI safety is to have concrete tools that can increase our confidence to much higher levels about the behavior of designed systems before seeing those systems work in practice on data that may be significantly different than we have access to. I’m not sure how this metaphor helps with those goals and I don’t find myself adjusting my prior beliefs at all (since I’ve always thought there was some significant chance that things would work out okay on their own—just not a high enough chance)