Risks from Approximate Value Learning
Solving the value learning problem is (IMO) the key technical challenge for AI safety.
How good or bad is an approximate solution?
EDIT for clarity:
By “approximate value learning” I mean something which does a good (but suboptimal from the perspective of safety) job of learning values. So it may do a good enough job of learning values to behave well most of the time, and be useful for solving tasks, but it still has a non-trivial chance of developing dangerous instrumental goals, and is hence an Xrisk.
1. How would developing good approximate value learning algorithms effect AI research/deployment?
It would enable more AI applications. For instance, many many robotics tasks such as “smooth grasping motion” are difficult to manually specify a utility function for. This could have positive or negative effects:
* It could encourage more mainstream AI researchers to work on value-learning.
* It could encourage more mainstream AI developers to use reinforcement learning to solve tasks for which “good-enough” utility functions can be learned.
Consider a value-learning algorithm which is “good-enough” to learn how to perform complicated, ill-specified tasks (e.g. folding a towel). But it’s still not quite perfect, and so every second, there is a 1⁄100,000,000 chance that it decides to take over the world. A robot using this algorithm would likely pass a year-long series of safety tests and seem like a viable product, but would be expected to decide to take over the world in ~3 years.
Without good-enough value learning, these tasks might just not be solved, or might be solved with safer approaches involving more engineering and less performance, e.g. using a collection of supervised learning modules and hand-crafted interfaces/heuristics.
2. What would a partially aligned AI do?
An AI programmed with an approximately correct value function might fail
* dramatically (see, e.g. Eliezer, on AIs “tiling the solar system with tiny smiley faces.”)
* relatively benignly (see, e.g. my example of an AI that doesn’t understand gustatory pleasure)
Perhaps a more significant example of benign partial-alignment would be an AI that has not learned all human values, but is corrigible and handles its uncertainty about its utility in a desirable way.