I would love to see more people write this sort of thing, seems very high-value for newcomers to orient, and for existing researchers to see how people understand/misunderstand their arguments, and for the writers to accelerate the process of orienting, and generally for people to understand the generators behind each others’ work and communicate better.
Some notes...
The best way to solve this problem is to specify a utility function that, for the most part, avoids instrumentally convergent goals (power seeking, preventing being turned off).
I’m not sure that a good formulation of corrigibility would look like a utility function at all.
Same with the “crux” later on:
The best way to make progress on alignment is to write down a utility function for an AI that:
Generalizes
Is robust to large optimization pressure
Specifies precisely what we want
Outer alignment need not necessarily look like a utility function. (There are good arguments that it will behave like a utility function, but that should probably be a derived fact rather than an assumed fact, at the very least.) And even if it is, there’s a classic failure mode in which someone says “well, it should maximize utility, so we want a function of the form u(X)...” and they don’t notice that most of the interesting work is in figuring out what the utility function is over (i.e. what “X” is), not the actual utility function.
Also, while we’re talking about a “crux”...
In each section, we’ve laid out some cruxes, which are statements that support that frame on the core of the alignment problem. These cruxes are not necessary or sufficient conditions for a problem to be central.
Terminological nitpick: the term “crux” was introduced to mean something such that, if you changed your mind about that thing, it would also probably change your mind about whatever thing we’re talking about (in this case, centrality of a problem). A crux is not just a statement which supports a frame.
We can get a soft-optimization proposal that works to solve this problem (instead of having the AGI hard-optimize something safe).
Not sure if this is already something you know, but “soft-optimization” is a thing we know how to do. The catch, of course, is that mild optimization can only do things which are “not very hard” in the sense that they don’t require very much optimization pressure.
This problem being tractable relies on some form of the Natural Abstractions Hypothesis.
There is, ultimately, going to end up being a thing like “Human Values,” that can be pointed to and holds up under strong optimization pressure.
Note that whether human values or corrigibility or “what I mean” or some other direct alignment target is a natural abstraction is not strictly part of the pointers problem. Pointers problem is about pointing to latent concepts in general; whether a given system has an internal latent variable corresponding to “human values” specifically is a separate question.
Also, while tractability of the pointers problem does depend heavily on NAH, note that it’s still a problem (and probably an even more core and urgent one!) even if NAH turns out to be relatively weak.
Overall, the problem-summaries were quite good IMO.
Subtle point: the key question is not how certain we are, but how certain the predictor system (e.g. GPT) is. Presumably if it’s able to generalize that far out of distribution at all, it’s likely to have enough understanding to make a pretty high-confidence guess as to whether AGI will take over or not. We humans might not know the answer very confidently, but an AI capable enough to apply the human mimicry strategy usefully is more likely to know the answer very confidently, whatever that answer is.