As I see it there are mainly two hard questions in alignment.
One is, how do you map human preferences in such a way that you can ask a machine to satisfy them. I don’t see any reason why this would be impossible for a superintelligent being to figure out. It is somewhere similar (though obviously not identical) to asking a human to figure out how to make fish happy.
The second is, how do you get a sufficiently intelligent machine to anything whatsoever without doing a lot of terrible stuff you didn’t want as a side effect? As Yudkoswky says:
The way I sometimes put it is that I think that almost all of the difficulty of the alignment problem is contained in aligning an AI on the task, “Make two strawberries identical down to the cellular (but not molecular) level.” Where I give this particular task because it is difficult enough to force the AI to invent new technology. It has to invent its own biotechnology, “Make two identical strawberries down to the cellular level.” It has to be quite sophisticated biotechnology, but at the same time, very clearly something that’s physically possible.
This does not sound like a deep moral question. It does not sound like a trolley problem. It does not sound like it gets into deep issues of human flourishing. But I think that most of the difficulty is already contained in, “Put two identical strawberries on a plate without destroying the whole damned universe.” There’s already this whole list of ways that it is more convenient to build the technology for the strawberries if you build your own superintelligences in the environment, and you prevent yourself from being shut down, or you build giant fortresses around the strawberries, to drive the probability to as close to 1 as possible that the strawberries got on the plate.
When I consider whether this implied desiderata is even possible, I just note that I and many others continue to not inject heroin. In fact, I almost never seem to act in ways that look much like driving the probability of any particular number as close to 1 as possible. So clearly it’s possible to embed some kind of motivational wiring into an intelligent being, such that the intelligent being achieves all sorts of interesting things without doing too many terrible things as a side effect. If I had to guess, I would say that the way we go about this is something like: wanting a bunch of different, largely incommensurable things at the same time, some of which are very abstract, some of which are mutually contradictory, and somehow all these different preferences keep the whole system mostly in balance most of the time. In other words, it’s inelegant and messy and not obvious how you would translate it into code, but it is there, and it seems to basically work. Or, at least, I think it works as well as we can expect, and serves as a limiting case.
As I see it there are mainly two hard questions in alignment.
One is, how do you map human preferences in such a way that you can ask a machine to satisfy them. I don’t see any reason why this would be impossible for a superintelligent being to figure out. It is somewhere similar (though obviously not identical) to asking a human to figure out how to make fish happy.
The second is, how do you get a sufficiently intelligent machine to anything whatsoever without doing a lot of terrible stuff you didn’t want as a side effect? As Yudkoswky says:
When I consider whether this implied desiderata is even possible, I just note that I and many others continue to not inject heroin. In fact, I almost never seem to act in ways that look much like driving the probability of any particular number as close to 1 as possible. So clearly it’s possible to embed some kind of motivational wiring into an intelligent being, such that the intelligent being achieves all sorts of interesting things without doing too many terrible things as a side effect. If I had to guess, I would say that the way we go about this is something like: wanting a bunch of different, largely incommensurable things at the same time, some of which are very abstract, some of which are mutually contradictory, and somehow all these different preferences keep the whole system mostly in balance most of the time. In other words, it’s inelegant and messy and not obvious how you would translate it into code, but it is there, and it seems to basically work. Or, at least, I think it works as well as we can expect, and serves as a limiting case.