e.g. I could argue against “1 + 1 = 2” by saying that it’s an infinite conjunction of “1 + 1 != 3″ AND “1 + 1 != 4” AND … and so it can’t possibly be true.
Uh, when I learned addition (in the foundation-of-mathematics sense) the fact that 2 was the only possible result of 1+1 was a big part of what made it addition / made addition useful.
There’s a huge structural similarity between the proof that ‘1 + 1 != 3’ and ‘1+1 != 4’; like, both are generic instances of the class ‘1 + 1 != n \forall n != 2’. We can increase the number of numbers without decreasing the plausibility of this claim (like, consider it in Z/4, then Z/8, then Z/16, then...).
But if instead I make a claim of the form “I am the only person who uses the name ‘Vaniver’”, we don’t have the same sort of structural similarity, and we do have to check the names of everyone else, and the more people there are, the less plausible the claim becomes.
Similarly, if we make an argument that something is an attractor in N-dimensional space, that does actually grow less plausible the more dimensions there are, since there are more ways for the thing to have a derivative that points away from the ‘attractor,’ if we think the dimensions aren’t all symmetric. (If there’s only gravity, for example, we seem in a better position to end up with attractors than if there’s a random force field, even in 4d, 8d, 16d, etc.; similarly if there’s a random potential function whose derivative is used to compute the forces.)
There’s a huge structural similarity between the proof that ‘1 + 1 != 3’ and ‘1+1 != 4’; like, both are generic instances of the class ‘1 + 1 != n \forall n != 2’. We can increase the number of numbers without decreasing the plausibility of this claim (like, consider it in Z/4, then Z/8, then Z/16, then...).
I feel like that’s exactly my point? Showing that something is a conjunction of a bunch of claims should not always make you think that claim is low probability, because there could be structural similarity between those claims such that a single argument is enough to argue for all of them.
(The claims “If X drifts away from corrigibility along dimension {N}, it will get pulled back” are clearly structurally similar, and the broad basin of corrigibility argument is meant to be an argument that argues for all of them.)
Similarly, if we make an argument that something is an attractor in N-dimensional space, that does actually grow less plausible the more dimensions there are, since there are more ways for the thing to have a derivative that points away from the ‘attractor,’ if we think the dimensions aren’t all symmetric.
1. Why aren’t the dimensions symmetric?
2. I somewhat buy the differential argument (more dimensions ⇒ less plausible) but not the absolute argument (therefore not plausible); this post is arguing for the absolute version:
it starts to feel awfully unlikely that corrigibility is really a broad basin of attraction after all
3. I’m not sure where the idea of a “derivative” is coming from—I thought we were talking about small random edits to the weights of a neural network. If we’re training the network on some objective that doesn’t incentivize corrigibility then certainly it won’t stay corrigible.
The claims “If X drifts away from corrigibility along dimension {N}, it will get pulled back” are clearly structurally similar, and the broad basin of corrigibility argument is meant to be an argument that argues for all of them.
To be clear, I think there are two very different arguments here:
1) If we have an AGI that is corrigible, it will not randomly drift to be not corrigible, because it will proactively notice and correct potential errors or loss of corrigibility.
2) If we have an AGI that is partly corrigible, it will help us ‘finish up’ the definition of corrigibility / edit itself to be more corrigible, because we want it to be more corrigible and it’s trying to do what we want.
The first is “corrigibility is a stable attractor”, and I think there’s structural similarity between arguments that different deviations will be corrected. The second is the “broad basin of corrigibility”, where for any barely acceptable initial definition of “do what we want”, it will figure out that “help us find the right definition of corrigibility and implement it” will score highly on its initial metric of “do what we want.”
Like, it’s not the argument that corrigibility is a stable attractor; it’s an argument that corrigibility is a stable attractor with no nearby attractors. (At least in the dimensions that it’s ‘broad’ in.)
I find it less plausible that missing pieces in our definition of “do what we want” will be fixed in structurally similar ways, and I think there are probably a lot of traps where a plausible sketch definition doesn’t automatically repair itself. One can lean here on “barely acceptable”, but I don’t find that very satisfying. [In particular, it would be nice if we had a definition of corrigibility where could look at it and say “yep, that’s the real deal or grows up to be the real deal,” tho that likely requires knowing what the “real deal” is; the “broad basin” argument seems to me to be meaningful only in that it claims “something that grows into the real deal is easy to find instead of hard to find,” and when I reword that claim as “there aren’t any dead ends near the real deal” it seems less plausible.]
1. Why aren’t the dimensions symmetric?
In physical space, generally things are symmetric between swapping the dimensions around; in algorithm-space, that isn’t true. (Like, permute the weights in a layer and you get different functional behavior.) Thus while it’s sort of wacky in a physical environment to say “oh yeah, df/dx, df/dy, and dy/dz are all independently sampled from a distribution” it’s less wacky to say that of neural network weights (or the appropriate medium-sized analog).
1) If we have an AGI that is corrigible, it will not randomly drift to be not corrigible, because it will proactively notice and correct potential errors or loss of corrigibility.
2) If we have an AGI that is partly corrigible, it will help us ‘finish out’ the definition of corrigibility / edit itself to be more corrigible, because we want it to be more corrigible and it’s trying to do what we want.
Good point on distinguishing these two arguments. It sounds like we agree on 1. I also thought the OP was talking about 1.
For 2, I don’t think we can make a dimensionality argument (as in the OP), because we’re talking about edits that are the ones that the AI chooses for itself. You can’t apply dimensionality arguments to choices made by intelligent agents (e.g. presumably you wouldn’t argue that every glass in my house must be broken because the vast majority of ways of interacting with glasses breaks them). Or put another way, the structural similarity is just “the AI wouldn’t choose to do <bad thing #N>”, in all cases because it’s intelligent and understands what it’s doing.
Now the question of “how right do we need to get the initial definition of corrigibility” is much less obvious. If you told me we got the definition wrong in a million different ways, I would indeed be worried and probably wouldn’t expect it to self-correct (depending on the meaning of “different”). But like… really? We get it wrong a million different ways? I don’t see why we’d expect that.
Like, it’s not the argument that corrigibility is a stable attractor; it’s an argument that corrigibility is a stable attractor with no nearby attractors. (At least in the dimensions that it’s ‘broad’ in.)
Just want to echo Rohin in saying that this is a very helpful distinction, thanks!
I was actually making the stronger argument that it’s not a stable attractor at all—at least not until someone solves the problem of how to maintain stable goals / motivations under learning / reflecting / ontological crises.
(The “someone” who solves the problem could be the AI, but it seems to be a hard problem even for human-level intelligence; cf. my comment here.)
Uh, when I learned addition (in the foundation-of-mathematics sense) the fact that 2 was the only possible result of 1+1 was a big part of what made it addition / made addition useful.
There’s a huge structural similarity between the proof that ‘1 + 1 != 3’ and ‘1+1 != 4’; like, both are generic instances of the class ‘1 + 1 != n \forall n != 2’. We can increase the number of numbers without decreasing the plausibility of this claim (like, consider it in Z/4, then Z/8, then Z/16, then...).
But if instead I make a claim of the form “I am the only person who uses the name ‘Vaniver’”, we don’t have the same sort of structural similarity, and we do have to check the names of everyone else, and the more people there are, the less plausible the claim becomes.
Similarly, if we make an argument that something is an attractor in N-dimensional space, that does actually grow less plausible the more dimensions there are, since there are more ways for the thing to have a derivative that points away from the ‘attractor,’ if we think the dimensions aren’t all symmetric. (If there’s only gravity, for example, we seem in a better position to end up with attractors than if there’s a random force field, even in 4d, 8d, 16d, etc.; similarly if there’s a random potential function whose derivative is used to compute the forces.)
I feel like that’s exactly my point? Showing that something is a conjunction of a bunch of claims should not always make you think that claim is low probability, because there could be structural similarity between those claims such that a single argument is enough to argue for all of them.
(The claims “If X drifts away from corrigibility along dimension {N}, it will get pulled back” are clearly structurally similar, and the broad basin of corrigibility argument is meant to be an argument that argues for all of them.)
1. Why aren’t the dimensions symmetric?
2. I somewhat buy the differential argument (more dimensions ⇒ less plausible) but not the absolute argument (therefore not plausible); this post is arguing for the absolute version:
3. I’m not sure where the idea of a “derivative” is coming from—I thought we were talking about small random edits to the weights of a neural network. If we’re training the network on some objective that doesn’t incentivize corrigibility then certainly it won’t stay corrigible.
To be clear, I think there are two very different arguments here:
1) If we have an AGI that is corrigible, it will not randomly drift to be not corrigible, because it will proactively notice and correct potential errors or loss of corrigibility.
2) If we have an AGI that is partly corrigible, it will help us ‘finish up’ the definition of corrigibility / edit itself to be more corrigible, because we want it to be more corrigible and it’s trying to do what we want.
The first is “corrigibility is a stable attractor”, and I think there’s structural similarity between arguments that different deviations will be corrected. The second is the “broad basin of corrigibility”, where for any barely acceptable initial definition of “do what we want”, it will figure out that “help us find the right definition of corrigibility and implement it” will score highly on its initial metric of “do what we want.”
Like, it’s not the argument that corrigibility is a stable attractor; it’s an argument that corrigibility is a stable attractor with no nearby attractors. (At least in the dimensions that it’s ‘broad’ in.)
I find it less plausible that missing pieces in our definition of “do what we want” will be fixed in structurally similar ways, and I think there are probably a lot of traps where a plausible sketch definition doesn’t automatically repair itself. One can lean here on “barely acceptable”, but I don’t find that very satisfying. [In particular, it would be nice if we had a definition of corrigibility where could look at it and say “yep, that’s the real deal or grows up to be the real deal,” tho that likely requires knowing what the “real deal” is; the “broad basin” argument seems to me to be meaningful only in that it claims “something that grows into the real deal is easy to find instead of hard to find,” and when I reword that claim as “there aren’t any dead ends near the real deal” it seems less plausible.]
In physical space, generally things are symmetric between swapping the dimensions around; in algorithm-space, that isn’t true. (Like, permute the weights in a layer and you get different functional behavior.) Thus while it’s sort of wacky in a physical environment to say “oh yeah, df/dx, df/dy, and dy/dz are all independently sampled from a distribution” it’s less wacky to say that of neural network weights (or the appropriate medium-sized analog).
Good point on distinguishing these two arguments. It sounds like we agree on 1. I also thought the OP was talking about 1.
For 2, I don’t think we can make a dimensionality argument (as in the OP), because we’re talking about edits that are the ones that the AI chooses for itself. You can’t apply dimensionality arguments to choices made by intelligent agents (e.g. presumably you wouldn’t argue that every glass in my house must be broken because the vast majority of ways of interacting with glasses breaks them). Or put another way, the structural similarity is just “the AI wouldn’t choose to do <bad thing #N>”, in all cases because it’s intelligent and understands what it’s doing.
Now the question of “how right do we need to get the initial definition of corrigibility” is much less obvious. If you told me we got the definition wrong in a million different ways, I would indeed be worried and probably wouldn’t expect it to self-correct (depending on the meaning of “different”). But like… really? We get it wrong a million different ways? I don’t see why we’d expect that.
Just want to echo Rohin in saying that this is a very helpful distinction, thanks!
I was actually making the stronger argument that it’s not a stable attractor at all—at least not until someone solves the problem of how to maintain stable goals / motivations under learning / reflecting / ontological crises.
(The “someone” who solves the problem could be the AI, but it seems to be a hard problem even for human-level intelligence; cf. my comment here.)