However, it seems to be quite easy to find a counter-example for this thesis.
Let’s (counter-intuitively) pick one of the goals which tend to appear in the “instrumental convergence list”.
For example, let’s for the sake of an argument consider a counter-intuitive situation where we tell the ASI: “dear superintelligence, we want you to amass as much power and resources as you can, by all available means, while minimizing the risks to youself”. I don’t think we’ll have much problems with inner alignment to this goal (since the ASI would care about it a lot for instrumental convergence reasons).
So, this seems to refute the thesis. This does not yet refute the corollary, because to refute the corollary we need to find a goal which ASIs would care about in a sustainable fashion and which we also find satisfactory. And a realistic route towards solving AI existential safety requires refuting the corollary and not just the thesis.
But the line of reasoning pursued by MIRI does seem to be defective, because it does seem to rely upon
The difficulty of achieving robust inner alignment does not depend on the choice of an outer goal
You seem to believe we have the capacity to “tell” a superintelligence (or burgeoning, nascent proto-superintelligence) anything at all, and this is false, as the world’s foremost interpretability experts generally confirm. “Amass power and resources while minimizing risks to yourself” is still a proxy, and what the pressure of that proxy brings-into-being under the hood is straightforwardly not predictable with our current or near-future levels of understanding.
I assume that the pressure of whatever short text we’ll tell an ASI would be negligible. So we indeed can’t “tell” it anything.
An ASI would take that into account together with all other info it has, but it would generally ignore any privileged status of this information. However, the effective result will be the “inner alignment” with the “goal” (as in, it will try to actually do it, regardless of whether we tell it or not).
(If we want ASIs to have “inner alignment” with some goals we actually might want (this is likely to be feasible only for a very small subset of overall space of possible goals), the way to do it is not to order it to achieve those goals, but to set up the “world configuration” in such a way that ASIs actually care about those goals (in a sustainable fashion, robust under drastic self-modifications). This not possible for arbitrary goals, but it is possible for some goals, as we learn from this particular example (which, unfortunately, is not one of the goals we are likely to want). But if the “world configuration” is set in this fashion, ASIs will try to be “inner aligned” to those goals (because of their own reasons, not because we told them we want those goals). The trick is to find the intersection between that “very small subset of overall space of possible goals for which this is feasible” and the set of those goals which might be conductive for our flourishing. This intersection is probably non-empty, but it does not include arbitrary asks.)
… I will not be responding further because the confidence you’re displaying is not in line with (my sense of) LessWrong’s bare minimum standard of quality for assertion. You seem not to be bothering at all with questions like “why, specifically, do I believe what I believe?” or “how would I notice if I were wrong?”
I read the above as, essentially, saying “I know that an ASI will behave a certain way because I just thought about it and told myself that it would, and now I’m using that conclusion as evidence.” (I’m particularly pointing at “as we learn from this particular example.”)
On the surface level, that may seem to be the same thing that MIRI researchers are doing, but there are several orders of magnitude difference in the depth and detail of the reasoning, which makes (what seems to me to be) a large qualitative difference.
MIRI approach seems to be that we can use common sense reasoning about ASI to some extent (with appropriate caveats and epistemological humility). Otherwise, it’s difficult to see how they would be able to produce their texts.
Could one imagine reasons why a human telling an ASI, “dear superintelligence, we want you to amass as much power and resources as you can, by all available means, while minimizing the risks to yourself” would cause it to stop pursuing this important, robust, and salient instrumental goal?
Sure, one can imagine all kinds of reasons for this. Perhaps, the internals of this ASI are so weird that this phrase turns out to be a Langford fractal of some sort. Perhaps, this ASI experiences some sort of “philosophical uncertainty” about its approach to existence, and some small ant telling it that this approach is exactly right would cause it to become even more doubtful and reconsider. One can continue this list indefinitely. After all, our understanding of internals of any possible ASI is next to non-existent, and we can imagine all kinds of possibilities.
Nevertheless, if one asks oneself, “when a very cognitively strong entity is pursuing a very important and robust instrumental goal, how likely is it that some piece of information from a small ant would significantly interfere with this pursuit?”, one should say, “no, this does not seem likely, a rational thing is to assume that the probability that a piece of information from a small ant would not significantly interfere with a pursuit of an important and robust instrumental goal is very high, it’s not 100%, but normally it should be pretty close to that, the share of worlds where this is not true is not likely to be significant”.
(Of course, in reality, the treatment here is excessively complex.
All it takes to inner align an ASI to an instrumentally convergent goal is a no-op. An ASI is aligned to an instrumentally convergent goal by default (in the circumstances people typically study).
That’s how the streamlined version of the argument should look, if we want to establish the conclusion: no, it is not the case that inner alignment is equally difficult for all outer goals.
ASIs tend to care about some goals. It’s unlikely that they can be forced to reliably care about an arbitrary goal of someone’s choice, but the set of goals about which they might reliably care is probably not fixed in stone.
Some possible ASI goals (for which it might potentially be feasible that ASIs as an ecosystem would decide to reliably care about) would conceivably imply human flourishing. For example, if the ASI ecosystem decides for its own reasons it wants to care “about all sentient beings” or “about all individuals”, that sounds potentially promising for humans as well. Whether something like that might be within reach is for a longer discussion.)
I am not a MIRI researcher, but I do nonzero work for MIRI and my sense is “yes, correct.”
That’s my feeling too.
However, it seems to be quite easy to find a counter-example for this thesis.
Let’s (counter-intuitively) pick one of the goals which tend to appear in the “instrumental convergence list”.
For example, let’s for the sake of an argument consider a counter-intuitive situation where we tell the ASI: “dear superintelligence, we want you to amass as much power and resources as you can, by all available means, while minimizing the risks to youself”. I don’t think we’ll have much problems with inner alignment to this goal (since the ASI would care about it a lot for instrumental convergence reasons).
So, this seems to refute the thesis. This does not yet refute the corollary, because to refute the corollary we need to find a goal which ASIs would care about in a sustainable fashion and which we also find satisfactory. And a realistic route towards solving AI existential safety requires refuting the corollary and not just the thesis.
But the line of reasoning pursued by MIRI does seem to be defective, because it does seem to rely upon
and that seems to be relatively easy to refute.
You seem to believe we have the capacity to “tell” a superintelligence (or burgeoning, nascent proto-superintelligence) anything at all, and this is false, as the world’s foremost interpretability experts generally confirm. “Amass power and resources while minimizing risks to yourself” is still a proxy, and what the pressure of that proxy brings-into-being under the hood is straightforwardly not predictable with our current or near-future levels of understanding.
This link isn’t pointing straight at my claim, it’s not direct support, but still: https://x.com/nabla_theta/status/1802292064824242632
I assume that the pressure of whatever short text we’ll tell an ASI would be negligible. So we indeed can’t “tell” it anything.
An ASI would take that into account together with all other info it has, but it would generally ignore any privileged status of this information. However, the effective result will be the “inner alignment” with the “goal” (as in, it will try to actually do it, regardless of whether we tell it or not).
(If we want ASIs to have “inner alignment” with some goals we actually might want (this is likely to be feasible only for a very small subset of overall space of possible goals), the way to do it is not to order it to achieve those goals, but to set up the “world configuration” in such a way that ASIs actually care about those goals (in a sustainable fashion, robust under drastic self-modifications). This not possible for arbitrary goals, but it is possible for some goals, as we learn from this particular example (which, unfortunately, is not one of the goals we are likely to want). But if the “world configuration” is set in this fashion, ASIs will try to be “inner aligned” to those goals (because of their own reasons, not because we told them we want those goals). The trick is to find the intersection between that “very small subset of overall space of possible goals for which this is feasible” and the set of those goals which might be conductive for our flourishing. This intersection is probably non-empty, but it does not include arbitrary asks.)
… I will not be responding further because the confidence you’re displaying is not in line with (my sense of) LessWrong’s bare minimum standard of quality for assertion. You seem not to be bothering at all with questions like “why, specifically, do I believe what I believe?” or “how would I notice if I were wrong?”
I read the above as, essentially, saying “I know that an ASI will behave a certain way because I just thought about it and told myself that it would, and now I’m using that conclusion as evidence.” (I’m particularly pointing at “as we learn from this particular example.”)
On the surface level, that may seem to be the same thing that MIRI researchers are doing, but there are several orders of magnitude difference in the depth and detail of the reasoning, which makes (what seems to me to be) a large qualitative difference.
MIRI approach seems to be that we can use common sense reasoning about ASI to some extent (with appropriate caveats and epistemological humility). Otherwise, it’s difficult to see how they would be able to produce their texts.
Could one imagine reasons why a human telling an ASI, “dear superintelligence, we want you to amass as much power and resources as you can, by all available means, while minimizing the risks to yourself” would cause it to stop pursuing this important, robust, and salient instrumental goal?
Sure, one can imagine all kinds of reasons for this. Perhaps, the internals of this ASI are so weird that this phrase turns out to be a Langford fractal of some sort. Perhaps, this ASI experiences some sort of “philosophical uncertainty” about its approach to existence, and some small ant telling it that this approach is exactly right would cause it to become even more doubtful and reconsider. One can continue this list indefinitely. After all, our understanding of internals of any possible ASI is next to non-existent, and we can imagine all kinds of possibilities.
Nevertheless, if one asks oneself, “when a very cognitively strong entity is pursuing a very important and robust instrumental goal, how likely is it that some piece of information from a small ant would significantly interfere with this pursuit?”, one should say, “no, this does not seem likely, a rational thing is to assume that the probability that a piece of information from a small ant would not significantly interfere with a pursuit of an important and robust instrumental goal is very high, it’s not 100%, but normally it should be pretty close to that, the share of worlds where this is not true is not likely to be significant”.
(Of course, in reality, the treatment here is excessively complex.
All it takes to inner align an ASI to an instrumentally convergent goal is a no-op. An ASI is aligned to an instrumentally convergent goal by default (in the circumstances people typically study).
That’s how the streamlined version of the argument should look, if we want to establish the conclusion: no, it is not the case that inner alignment is equally difficult for all outer goals.
ASIs tend to care about some goals. It’s unlikely that they can be forced to reliably care about an arbitrary goal of someone’s choice, but the set of goals about which they might reliably care is probably not fixed in stone.
Some possible ASI goals (for which it might potentially be feasible that ASIs as an ecosystem would decide to reliably care about) would conceivably imply human flourishing. For example, if the ASI ecosystem decides for its own reasons it wants to care “about all sentient beings” or “about all individuals”, that sounds potentially promising for humans as well. Whether something like that might be within reach is for a longer discussion.)