The post did emphasize, in many places, that there may not be any real-world thing corresponding to a human concept, and therefore constructing a pointer is presumably impossible. But the “thing may not exist” problem is only one potential blocker to constructing a pointer. Just because there exists some real-world structure corresponding to a human concept, or an AI has some structure corresponding to a human concept, does not mean we have a pointer. It just means that it should be possible, in principle, to create a pointer.
So, the concept-existence problem is a strict subset of the pointer problem.
Second, there are definitely parts of a whole alignment scheme which are not the pointer problem. For instance, inner alignment, decision theory shenanigans (e.g. commitment races), and corrigibility are all orthogonal to the pointers problem (or at least the pointers-to-values problem). Constructing a feedback signal which rewards the thing we want is not the same as building an AI which does the thing we want.
Third, and most important...
My second concern is that requiring pointers to be sufficient to “to get the AI to do what we mean” means that they might differ wildly depending on the motivation system of that specific AI and the details of “what we mean”. For example...
The key point is that all these examples involve solving an essentially-similar pointer problem. In example A, we need to ensure that our English-language commands are sufficient to specify everything we care about which the alien wouldn’t guess on its own; that’s the part which is a pointer problem. In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that’s the part which is a pointer problem. In example C, we need to identify which of its neurons correspond to the concepts we want, and ensure that the correspondence is robust; that’s the part which is a pointer problem. Example D is essentially the same as B, with weaker implicit priors.
The essence of each of these is “make sure we actually point to the thing we want, and not to anything else”. That’s the part which is a pointer problem.
To put it differently, the whole alignment problem is “get an AI to do what I mean”, while the pointer problem part is roughly “specify what I mean well enough that I could use the specification to get an AI to do what I mean”, assuming problems like “get AI to follow specification” can be solved.
In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that’s the part which is a pointer problem.
Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?
the pointer problem part is roughly “specify what I mean well enough that I could use the specification to get an AI to do what I mean”, assuming problems like “get AI to follow specification” can be solved.
On the view of this post, is it that we would get a really good “evaluation module” for the AI to use, and the “get AI to follow specification” corresponds to “make AI want to generate plans evaluated highly by that procedure”? Or something else?
Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?
In the context of the discussion with Richard, I was assuming the general model in which we want an inner optimizer’s objective to match an outer optimization objective. We can of course drop that assumption (as you usually do), but then we still need to know what objective/values we want to imbue in the final system. And whatever final objective/values we’re aiming for, we need it to actually match what we want along all the relevant dimensions, and not allow any degrees of freedom to Goodhart; that would be the corresponding problem for the sort of approach you think about.
On the view of this post, is it that we would get a really good “evaluation module” for the AI to use, and the “get AI to follow specification” corresponds to “make AI want to generate plans evaluated highly by that procedure”? Or something else?
No, I am not assuming anything that specific. The pointers problem is not meant to be a problem with one particular class of approaches to constructing aligned AI; it is meant to be a foundational problem in saying what-we-want. Insofar as we haven’t solved the pointers problem, we have not yet understood the type signature of our own values; not only do we not know what we want, we don’t even understand the type signature of “wanting things”.
Thanks for the reply. To check that I understand your position, would you agree that solving outer alignment plus solving reward tampering would solve the pointers problem in the context of machine learning?
Broadly speaking, I think our disagreement here is closely related to one we’ve discussed before, about how much sense it makes to talk about outer alignment in isolation (and also about your definition of inner alignment), so I probably won’t pursue this further.
Yeah, I wouldn’t even include reward tampering. Outer alignment, as I think about it, is mostly the pointer problem, and the (values) pointer problem is a subset of outer alignment. (Though e.g. Evan would define it differently.)
Ok, a few things here...
The post did emphasize, in many places, that there may not be any real-world thing corresponding to a human concept, and therefore constructing a pointer is presumably impossible. But the “thing may not exist” problem is only one potential blocker to constructing a pointer. Just because there exists some real-world structure corresponding to a human concept, or an AI has some structure corresponding to a human concept, does not mean we have a pointer. It just means that it should be possible, in principle, to create a pointer.
So, the concept-existence problem is a strict subset of the pointer problem.
Second, there are definitely parts of a whole alignment scheme which are not the pointer problem. For instance, inner alignment, decision theory shenanigans (e.g. commitment races), and corrigibility are all orthogonal to the pointers problem (or at least the pointers-to-values problem). Constructing a feedback signal which rewards the thing we want is not the same as building an AI which does the thing we want.
Third, and most important...
The key point is that all these examples involve solving an essentially-similar pointer problem. In example A, we need to ensure that our English-language commands are sufficient to specify everything we care about which the alien wouldn’t guess on its own; that’s the part which is a pointer problem. In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that’s the part which is a pointer problem. In example C, we need to identify which of its neurons correspond to the concepts we want, and ensure that the correspondence is robust; that’s the part which is a pointer problem. Example D is essentially the same as B, with weaker implicit priors.
The essence of each of these is “make sure we actually point to the thing we want, and not to anything else”. That’s the part which is a pointer problem.
To put it differently, the whole alignment problem is “get an AI to do what I mean”, while the pointer problem part is roughly “specify what I mean well enough that I could use the specification to get an AI to do what I mean”, assuming problems like “get AI to follow specification” can be solved.
Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?
On the view of this post, is it that we would get a really good “evaluation module” for the AI to use, and the “get AI to follow specification” corresponds to “make AI want to generate plans evaluated highly by that procedure”? Or something else?
In the context of the discussion with Richard, I was assuming the general model in which we want an inner optimizer’s objective to match an outer optimization objective. We can of course drop that assumption (as you usually do), but then we still need to know what objective/values we want to imbue in the final system. And whatever final objective/values we’re aiming for, we need it to actually match what we want along all the relevant dimensions, and not allow any degrees of freedom to Goodhart; that would be the corresponding problem for the sort of approach you think about.
No, I am not assuming anything that specific. The pointers problem is not meant to be a problem with one particular class of approaches to constructing aligned AI; it is meant to be a foundational problem in saying what-we-want. Insofar as we haven’t solved the pointers problem, we have not yet understood the type signature of our own values; not only do we not know what we want, we don’t even understand the type signature of “wanting things”.
Thanks for the reply. To check that I understand your position, would you agree that solving outer alignment plus solving reward tampering would solve the pointers problem in the context of machine learning?
Broadly speaking, I think our disagreement here is closely related to one we’ve discussed before, about how much sense it makes to talk about outer alignment in isolation (and also about your definition of inner alignment), so I probably won’t pursue this further.
Yeah, I wouldn’t even include reward tampering. Outer alignment, as I think about it, is mostly the pointer problem, and the (values) pointer problem is a subset of outer alignment. (Though e.g. Evan would define it differently.)