Paul creates a sub problem of alignment which is “alignment with low stakes.” Basically, this problem has one relaxation from the full problem: We never have to care about single decisions, or more formally traps cannot happen in a small set of actions.
Another way to say it is we temporarily limit distributional shift to safe bounds.
I like this relaxation of the problem, because it gets at a realistic outcome we may be able to reach, and in particular it let’s people work on it without much context.
However, the fact inner alignment doesn’t need to be solved may be a problem depending on your beliefs about outer vs inner alignment.
In retrospect, I’ve become more optimistic about alignment, and I now think there’s quite a bit more evidence towards the low-stakes version of the alignment problem being quite easy to solve for AIs, and while I don’t expect anything close to formal proof (because of edge cases), I now think that the techniques that were developed are enough such that we can declare victory on the low-stakes alignment problem, and as a consequence I now think we can declare the outer alignment problem to be mostly solved at this point (given Paul’s definition of low-stakes alignment).
The fact that a lot of AI risk scenarios rely more on deceptively aligned AI changing the world in an abrupt and irreversible way than in the past corroborates this.
Paul creates a sub problem of alignment which is “alignment with low stakes.” Basically, this problem has one relaxation from the full problem: We never have to care about single decisions, or more formally traps cannot happen in a small set of actions.
Another way to say it is we temporarily limit distributional shift to safe bounds.
I like this relaxation of the problem, because it gets at a realistic outcome we may be able to reach, and in particular it let’s people work on it without much context.
However, the fact inner alignment doesn’t need to be solved may be a problem depending on your beliefs about outer vs inner alignment.
I’d give it a +3 in my opinion.
In retrospect, I’ve become more optimistic about alignment, and I now think there’s quite a bit more evidence towards the low-stakes version of the alignment problem being quite easy to solve for AIs, and while I don’t expect anything close to formal proof (because of edge cases), I now think that the techniques that were developed are enough such that we can declare victory on the low-stakes alignment problem, and as a consequence I now think we can declare the outer alignment problem to be mostly solved at this point (given Paul’s definition of low-stakes alignment).
The fact that a lot of AI risk scenarios rely more on deceptively aligned AI changing the world in an abrupt and irreversible way than in the past corroborates this.