The case for value learning

This post is mainly fumbling around trying to define a reasonable research direction for contributing to FAI research. I’ve found that laying out what success looks like in the greatest possible detail is a personal motivational necessity. Criticism is strongly encouraged.

The power and intelligence of machines has been gradually and consistently increasing over time, it seems likely that at some point machine intelligence will surpass the power and intelligence of humans. Before that point occurs, it is important that humanity manages to direct these powerful optimizers towards a target that humans find desirable.

This is difficult because humans as a general rule have a fairly fuzzy conception of their own values, and it seems unlikely that the millennia of argument surrounding what precisely constitutes eudaimonia are going to be satisfactorily wrapped up before the machines get smart. The most obvious solution is to try to leverage some of the novel intelligence of the machines to help resolve the issue before it is too late.

Lots of people regard using a machine to help you understand human values as a chicken and egg problem. They think that a machine capable of helping us understand what humans value must also necessarily be smart enough to do AI programming, manipulate humans, and generally take over the world. I am not sure that I fully understand why people believe this.

Part of it seems to be inherent in the idea of AGI, or an artificial general intelligence. There seems to be the belief that once an AI crosses a certain threshold of smarts, it will be capable of understanding literally everything. I have even heard people describe certain problems as “AI-complete”, making an explicit comparison to ideas like Turing-completeness. If a Turing machine is a universal computer, why wouldn’t there also be a universal intelligence?

To address the question of universality, we need to make a distinction between intelligence and problem solving ability. Problem solving ability is typically described as a function of both intelligence and resources, and just throwing resources at a problem seems to be capable of compensating for a lot of cleverness. But if problem-solving ability is tied to resources, then intelligent agents are in some respects very different from Turing machines, since Turing machines are all explicitly operating with an infinite amount of tape. Many of the existential risk scenarios revolve around the idea of the intelligence explosion, when an AI starts to do things that increase the intelligence of the AI so quickly that these resource restrictions become irrelevant. This is conceptually clean, in the same way that Turing machines are, but navigating these hard take-off scenarios well implies getting things absolutely right the first time, which seems like a less than ideal project requirement.

If an AI that knows a lot about AI results in an intelligence explosion, but we also want an AI that’s smart enough to understand human values, is it possible to create an AI that can understand human values, but not AI programming? In principle it seems like this should be possible. Resources useful for understanding human values don’t necessarily translate into resources useful for understanding AI programming. The history of AI development is full of tasks that were supposed to be solvable only by a machine smart enough to possess general intelligence, where significant progress was made in understanding and pre-digesting the task, allowing problems in the domain to be solved by much less intelligent AIs.

If this is possible, then the best route forward is focusing on value learning. The path to victory is working on building limited AI systems that are capable of learning and understanding human values, and then disseminating that information. This effectively softens the AI take-off curve in the most useful possible way, and allows us to practice building AI with human values before handing them too much power to control. Even if AI research is comparatively easy compared to the complexity of human values, a specialist AI might find thinking about human values easier than reprogramming itself, in the same way that humans find complicated visual/​verbal tasks much easier than much simpler tasks like arithmetic. The human intelligence learning algorithm is trained on visual object recognition and verbal memory tasks, and it uses those tools to perform addition. A similarly specialized AI might be capable of rapidly understanding human values, but find AI programming as difficult as humans find determining whether 1007 is prime. As an additional incentive, value learning has an enormous potential for improving human rationality and the effectiveness of human institutions even without the creation of a superintelligence. A system that helped people better understand the mapping between values and actions would be a potent weapon in the struggle with Moloch.

Building a relatively unintelligent AI and giving it lots of human values resources to help it solve the human values problem seems like a reasonable course of action, if it’s possible. There are some difficulties with this approach. One of these difficulties is that after a certain point, no amount of additional resources compensates for a lack of intelligence. A simple reflex agent like a thermostat doesn’t learn from data and throwing resources at it won’t improve its performance. To some extent you can make up for intelligence with data, but only to some extent. An AI capable of learning human values is going to be capable of learning lots of other things. It’s going to need to build models of the world, and it’s going to have to have internal feedback mechanisms to correct and refine those models.

If the plan is to create an AI and primarily feed it data on how to understand human values, and not feed it data on how to do AI programming and self-modify, that plan is complicated by the fact that inasmuch as the AI is capable of self-observation, it has access to sophisticated AI programming. I’m not clear on how much this access really means. My own introspection hasn’t allowed me anything like hardware level access to my brain. While it seems possible to create an AI that can refactor its own code or create successors, it isn’t obvious that AIs created for other purposes will have this ability on accident.

This discussion focuses on intelligence amplification as the example path to superintelligence, but other paths do exist. An AI with a sophisticated enough world model, even if somehow prevented from understanding AI, could still potentially increase its own power to threatening levels. Value learning is only the optimal way forward if human values are emergent, if they can be understood without a molecular level model of humans and the human environment. If the only way to understand human values is with physics, then human values isn’t a meaningful category of knowledge with its own structure, and there is no way to create a machine that is capable of understanding human values, but not capable of taking over the world.

In the fairy tale version of this story, a research community focused on value learning manages to use specialized learning software to make the human value program portable, instead of only running on human hardware. Having a large number of humans involved in the process helps us avoid lots of potential pitfalls, especially the research overfitting to the values of the researchers via the typical mind fallacy. Partially automating introspection helps raise the sanity waterline. Humans practice coding the human value program, in whole or in part, into different automated systems. Once we’re comfortable that our self-driving cars have a good grasp on the trolley problem, we use that experience to safely pursue higher risk research on recursive systems likely to start an intelligence explosion. FAI gets created and everyone lives happily ever after.

Whether value learning is worth focusing on seems to depend on the likelihood of the following claims. Please share your probability estimates (and explanations) with me because I need data points that originated outside of my own head.

I can’t figure out how to include working polls in a post, but there should be a working version in the comments.
  1. There is regular structure in human values that can be learned without requiring detailed knowledge of physics, anatomy, or AI programming. [poll:probability]

  2. Human values are so fragile that it would require a superintelligence to capture them with anything close to adequate fidelity.[poll:probability]

  3. Humans are capable of pre-digesting parts of the human values problem domain. [poll:probability]

  4. Successful techniques for value discovery of non-humans, (e.g. artificial agents, non-human animals, human institutions) would meaningfully translate into tools for learning human values. [poll:probability]

  5. Value learning isn’t adequately being researched by commercial interests who want to use it to sell you things. [poll:probability]

  6. Practice teaching non-superintelligent machines to respect human values will improve our ability to specify a Friendly utility function for any potential superintelligence.[poll:probability]

  7. Something other than AI will cause human extinction sometime in the next 100 years.[poll:probability]

  8. All other things being equal, an additional researcher working on value learning is more valuable than one working on corrigibility, Vingean reflection, or some other portion of the FAI problem. [poll:probability]