Training convergence thesis: a wide range of environments in which we could train an AGI will lead to the development of goal-directed behaviour aimed towards certain convergent goals.
I think this is important and I’ve been thinking about it for a while (in fact, it seems quite similar to a distinction I made in a comment on your myopic training post). I’m glad to see a post giving this a crisp handle.
But I think that the ‘training convergence thesis’ is a bad name, and I hope it doesn’t stick (just as I’m pushing to move away from ‘instrumental convergence’ towards ‘robust instrumentality’). There are many things which may converge over the course of training; although it’s clear to us in the context of this post, to an outsider, it’s not that clear what ‘training convergence’ refers to.
Furthermore, ‘convergence’ in the training context may imply that these instrumental incentives tend stick in the limit of training, which may not be true and distracts from the substance of the claim.
I like this decomposition as well. I recently wrote about fragility of value from a similar perspective, although I think fragility of value extends beyond AI alignment (you may already agree with that).
Ah, cool; I like the way you express it in the short form! I’ve been looking into the concept of structuralism in evolutionary biology, which is the belief that evolution is strongly guided by “structural design principles”. You might find the analogy interesting.
One quibble: in your comment on my previous post, you distinguished between optimal policies versus the policies that we’re actually likely to train. But this isn’t a component of my distinction—in both cases I’m talking about policies which actually arise from training. My point is that there are two different ways in which we might get “learned policies which pursue convergent instrumental subgoals”—they might do so for instrumental reasons, or for final reasons. (I guess this is what you had in mind, but wanted to clarify since I originally interpreted your comment as only talking about the optimality/practice distinction.)
On terminology, would you prefer the “training goal convergence thesis”? I think “robust” is just as misleading a term as “convergence”, in that neither are usually defined in terms of what happens when you train in many different environments. And so, given switching costs, I think it’s fine to keep talking about instrumental convergence.
One quibble: in your comment on my previous post, you distinguished between optimal policies versus the policies that we’re actually likely to train. But this isn’t a component of my distinction—in both cases I’m talking about policies which actually arise from training.
Right—I was pointing at the similarity in that both of our distinctions involve some aspect of training, which breaks from the tradition of not really considering training’s influence on robust instrumentality. “Quite similar” was poor phrasing on my part, because I agree that our two distinctions are materially different.
On terminology, would you prefer the “training goal convergence thesis”?
I think that “training goal convergence thesis” is way better, and I like how it accomodates dual meanings: the “goal” may be an instrumental or a final goal.
I think “robust” is just as misleading a term as “convergence”, in that neither are usually defined in terms of what happens when you train in many different environments.
Can you elaborate? ‘Robust’ seems natural for talking about robustness to perturbation in the initial AI design (different objective functions, to the extent that that matters) and robustness against choice of environment.
And so, given switching costs, I think it’s fine to keep talking about instrumental convergence.
I agree that switching costs are important to consider. However, I’ve recently started caring more about establishing and promoting clear nomenclature, both for the purposes of communication and for clearer personal thinking.
My model of the ‘instrumental convergence’ situation is something like:
The switching costs are primarily sensitive to how firmly established the old name is, to how widely used the old name is, and the number of “entities” which would have to adopt the new name.
I think that if researchers generally agree that ‘robust instrumentality’ is a clearer name[1] and used it to talk about the concept, that the shift would naturally propagate through AI alignment circles and be complete within a year or two. This is just my gut sense, though.
The switch from “optimization daemons” to “mesa-optimizers” seemed to go pretty well
But ‘optimization daemons’ didn’t have a wikipedia page yet (unlike ‘instrumental convergence’)
Of course, all of this is conditional on your agreeing that ‘robust instrumentality’ is in fact a better name; if you disagree, I’m interested in hearing why.[2] But if you agree, I think that the switch would probably happen if people are willing to absorb a small communicational overhead for a while as the meme propagates. (And I do think it’s small—I talk about robust instrumentality all the time, and it really doesn’t take long to explain the switch)
On the bright side, I think the situation for ‘instrumental convergence / robust instrumentality’ is better than the one for ‘corrigibility’, where we have a single handle for wildly different concepts!
[1] A clearer name—once explained to the reader, at least; ‘robust instrumentality’ unfortunately isn’t as transparent as ‘factored cognition hypothesis.’
[2] Especially before the 2019 LW review book is published, as it seems probable that Seeking Power is Often Robustly Instrumental in MDPs will be included. I am ready to be convinced that there exists an even better name than ‘robust instrumentality’ and to rework my writing accordingly.
Can you elaborate? ‘Robust’ seems natural for talking about robustness to perturbation in the initial AI design (different objective functions, to the extent that that matters) and robustness against choice of environment.
The first ambiguity I dislike here is that you could either be describing the emergence of instrumentality as robust, or the trait of instrumentalityas robust. It seems like you’re trying to do the former, but because “robust” modifies “instrumentality”, the latter is a more natural interpretation.
For example, if I said “life on earth is very robust”, the natural interpretation is: given that life exists on earth, it’ll be hard to wipe it out. Whereas an emergence-focused interpretation (like yours) would be: life would probably have emerged given a wide range of initial conditions on earth. But I imagine that very few people would interpret my original statement in that way.
The second ambiguity I dislike: even if we interpret “robust instrumentality” as the claim that “the emergence of instrumentality is robust”, this still doesn’t get us what we want. Bostrom’s claim is not just that instrumental reasoning usually emerges; it’s that specific instrumental goals usually emerge. But “instrumentality” is more naturally interpreted as the general tendency to do instrumental reasoning.
On switching costs: Bostrom has been very widely read, so changing one of his core terms will be much harder than changing a niche working handle like “optimisation daemon”, and would probably leave a whole bunch of people confused for quite a while. I do agree the original term is flawed though, and will keep an eye out for potential alternatives—I just don’t think robust instrumentality is clear enough to serve that role.
The first ambiguity I dislike here is that you could either be describing the emergence of instrumentality as robust, or the trait of instrumentalityas robust. It seems like you’re trying to do the former, but because “robust” modifies “instrumentality”, the latter is a more natural interpretation.
One possibility is that we have to individuate these “instrumental convergence”-adjacent theses using different terminology. I think ‘robust instrumentality’ is basically correct for optimal actions, because there’s no question of ‘emergence’: optimal actions just are.
However, it doesn’t make sense to say the same for conjectures about how training such-and-such a system tends to induce property Y, for the reasons you mention. In particular, if property Y is not about goal-directed behavior, then it no longer makes sense to talk about ‘instrumentality’ from the system’s perspective. e.g. I’m not sure it makes sense to say ‘edge detectors are robustly instrumental for this network structure on this dataset after X epochs’.
(These are early thoughts; I wanted to get them out, and may revise them later or add another comment)
EDIT: In the context of MDPs, however, I prefer to talk in terms of (formal) POWER and of optimality probability, instead of in terms of robust instrumentality. I find ‘robust instrumentality’ to be better as an informal handle, but its formal operationalization seems better for precise thinking.
I think ‘robust instrumentality’ is basically correct for optimal actions, because there’s no question of ‘emergence’: optimal actions just are.
If I were to put my objection another way: I usually interpret “robust” to mean something like “stable under perturbations”. But the perturbation of “change the environment, and then see what the new optimal policy is” is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent’s inputs, or its state, and seeing whether it still behaved instrumentally.
A more accurate description might be something like “ubiquitous instrumentality”? But this isn’t a very aesthetically pleasing name.
But the perturbation of “change the environment, and then see what the new optimal policy is” is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent’s inputs, or its state, and seeing whether it still behaved instrumentally.
Ah. To clarify, I was referring to holding an environment fixed, and then considering whether, at a given state, an action has a high probability of being optimal across reward functions. I think it makes to call those actions ‘robustly instrumental.’
A more accurate description might be something like “ubiquitous instrumentality”? But this isn’t a very aesthetically pleasing name.
I’d considered ‘attractive instrumentality’ a few days ago, to convey the idea that certain kinds of subgoals are attractor points during plan formulation, but the usual reading of ‘attractive’ isn’t ‘having attractor-like properties.’
I think this is important and I’ve been thinking about it for a while (in fact, it seems quite similar to a distinction I made in a comment on your myopic training post). I’m glad to see a post giving this a crisp handle.
But I think that the ‘training convergence thesis’ is a bad name, and I hope it doesn’t stick (just as I’m pushing to move away from ‘instrumental convergence’ towards ‘robust instrumentality’). There are many things which may converge over the course of training; although it’s clear to us in the context of this post, to an outsider, it’s not that clear what ‘training convergence’ refers to.
Furthermore, ‘convergence’ in the training context may imply that these instrumental incentives tend stick in the limit of training, which may not be true and distracts from the substance of the claim.
Perhaps “robust instrumentality thesis (training)” (versus “robust instrumentality thesis (optimality)” or “robust finality thesis (training)”)?
I like this decomposition as well. I recently wrote about fragility of value from a similar perspective, although I think fragility of value extends beyond AI alignment (you may already agree with that).
Ah, cool; I like the way you express it in the short form! I’ve been looking into the concept of structuralism in evolutionary biology, which is the belief that evolution is strongly guided by “structural design principles”. You might find the analogy interesting.
One quibble: in your comment on my previous post, you distinguished between optimal policies versus the policies that we’re actually likely to train. But this isn’t a component of my distinction—in both cases I’m talking about policies which actually arise from training. My point is that there are two different ways in which we might get “learned policies which pursue convergent instrumental subgoals”—they might do so for instrumental reasons, or for final reasons. (I guess this is what you had in mind, but wanted to clarify since I originally interpreted your comment as only talking about the optimality/practice distinction.)
On terminology, would you prefer the “training goal convergence thesis”? I think “robust” is just as misleading a term as “convergence”, in that neither are usually defined in terms of what happens when you train in many different environments. And so, given switching costs, I think it’s fine to keep talking about instrumental convergence.
Right—I was pointing at the similarity in that both of our distinctions involve some aspect of training, which breaks from the tradition of not really considering training’s influence on robust instrumentality. “Quite similar” was poor phrasing on my part, because I agree that our two distinctions are materially different.
I think that “training goal convergence thesis” is way better, and I like how it accomodates dual meanings: the “goal” may be an instrumental or a final goal.
Can you elaborate? ‘Robust’ seems natural for talking about robustness to perturbation in the initial AI design (different objective functions, to the extent that that matters) and robustness against choice of environment.
I agree that switching costs are important to consider. However, I’ve recently started caring more about establishing and promoting clear nomenclature, both for the purposes of communication and for clearer personal thinking.
My model of the ‘instrumental convergence’ situation is something like:
The switching costs are primarily sensitive to how firmly established the old name is, to how widely used the old name is, and the number of “entities” which would have to adopt the new name.
I think that if researchers generally agree that ‘robust instrumentality’ is a clearer name[1] and used it to talk about the concept, that the shift would naturally propagate through AI alignment circles and be complete within a year or two. This is just my gut sense, though.
The switch from “optimization daemons” to “mesa-optimizers” seemed to go pretty well
But ‘optimization daemons’ didn’t have a wikipedia page yet (unlike ‘instrumental convergence’)
Of course, all of this is conditional on your agreeing that ‘robust instrumentality’ is in fact a better name; if you disagree, I’m interested in hearing why.[2] But if you agree, I think that the switch would probably happen if people are willing to absorb a small communicational overhead for a while as the meme propagates. (And I do think it’s small—I talk about robust instrumentality all the time, and it really doesn’t take long to explain the switch)
On the bright side, I think the situation for ‘instrumental convergence / robust instrumentality’ is better than the one for ‘corrigibility’, where we have a single handle for wildly different concepts!
[1] A clearer name—once explained to the reader, at least; ‘robust instrumentality’ unfortunately isn’t as transparent as ‘factored cognition hypothesis.’
[2] Especially before the 2019 LW review book is published, as it seems probable that Seeking Power is Often Robustly Instrumental in MDPs will be included. I am ready to be convinced that there exists an even better name than ‘robust instrumentality’ and to rework my writing accordingly.
The first ambiguity I dislike here is that you could either be describing the emergence of instrumentality as robust, or the trait of instrumentality as robust. It seems like you’re trying to do the former, but because “robust” modifies “instrumentality”, the latter is a more natural interpretation.
For example, if I said “life on earth is very robust”, the natural interpretation is: given that life exists on earth, it’ll be hard to wipe it out. Whereas an emergence-focused interpretation (like yours) would be: life would probably have emerged given a wide range of initial conditions on earth. But I imagine that very few people would interpret my original statement in that way.
The second ambiguity I dislike: even if we interpret “robust instrumentality” as the claim that “the emergence of instrumentality is robust”, this still doesn’t get us what we want. Bostrom’s claim is not just that instrumental reasoning usually emerges; it’s that specific instrumental goals usually emerge. But “instrumentality” is more naturally interpreted as the general tendency to do instrumental reasoning.
On switching costs: Bostrom has been very widely read, so changing one of his core terms will be much harder than changing a niche working handle like “optimisation daemon”, and would probably leave a whole bunch of people confused for quite a while. I do agree the original term is flawed though, and will keep an eye out for potential alternatives—I just don’t think robust instrumentality is clear enough to serve that role.
One possibility is that we have to individuate these “instrumental convergence”-adjacent theses using different terminology. I think ‘robust instrumentality’ is basically correct for optimal actions, because there’s no question of ‘emergence’: optimal actions just are.
However, it doesn’t make sense to say the same for conjectures about how training such-and-such a system tends to induce property Y, for the reasons you mention. In particular, if property Y is not about goal-directed behavior, then it no longer makes sense to talk about ‘instrumentality’ from the system’s perspective. e.g. I’m not sure it makes sense to say ‘edge detectors are robustly instrumental for this network structure on this dataset after X epochs’.
(These are early thoughts; I wanted to get them out, and may revise them later or add another comment)
EDIT: In the context of MDPs, however, I prefer to talk in terms of (formal) POWER and of optimality probability, instead of in terms of robust instrumentality. I find ‘robust instrumentality’ to be better as an informal handle, but its formal operationalization seems better for precise thinking.
If I were to put my objection another way: I usually interpret “robust” to mean something like “stable under perturbations”. But the perturbation of “change the environment, and then see what the new optimal policy is” is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent’s inputs, or its state, and seeing whether it still behaved instrumentally.
A more accurate description might be something like “ubiquitous instrumentality”? But this isn’t a very aesthetically pleasing name.
Ah. To clarify, I was referring to holding an environment fixed, and then considering whether, at a given state, an action has a high probability of being optimal across reward functions. I think it makes to call those actions ‘robustly instrumental.’
I’d considered ‘attractive instrumentality’ a few days ago, to convey the idea that certain kinds of subgoals are attractor points during plan formulation, but the usual reading of ‘attractive’ isn’t ‘having attractor-like properties.’