Conditioned on [the first AGI is created at time t by AI lab X], it is very unlikely that immediately before t the researchers at X have a very low credence in the proposition “we will create an AGI sometime in the next 30 days”.
It wasn’t exactly that (in particular, I didn’t have the researcher’s beliefs in mind), but I also believe that statement for basically the same reasons so that should be fine. There’s a lot of ambiguity in that statement (specifically, what is AGI), but I probably believe it for most operationalizations of AGI.
(For reference, I was considering “will there be a 1 year doubling of economic output that started before the first 4 year doubling of economic output ended”; for that it’s not sufficient to just argue that we will get AGI suddenly, you also have to argue that the AGI will very quickly become superintelligent enough to double economic output in a very short amount of time.)
I’m pretty agnostic about whether the result of that $100M NAS would be “almost AGI”.
I mean, the difference between a $100M NAS and a $1B NAS is:
Up to 10x the number of models evaluated
Up to 10x the size of models evaluated
If you increase the number of models by 10x and leave the size the same, that somewhat increases your optimization power. If you model the NAS as picking architectures randomly, the $1B NAS can have at most 10x the chance of finding AGI, regardless of fragility, and so can only have at most 10x the expected “value” (whatever your notion of “value”).
If you then also model architectures as non-fragile, then once you have some optimization power, adding more optimization power doesn’t make much of a difference, e.g. the max of n draws from Uniform([0, 1]) has expected value nn+1=1−1n+1, so once n is already large (e.g. 100), increasing it makes ~no difference. Of course, our actual distributions will probably be more bottom-heavy, but as distributions get more bottom-heavy we use gradient descent / evolutionary search to deal with that.
For the size, it’s possible that increases in size lead to huge increases in intelligence, but that doesn’t seem to agree with ML practice so far. Even if you ignore trend extrapolation, I don’t see a reason to expect that increasing model sizes should mean the difference between not-even-close-to-AGI and AGI.
If you model the NAS as picking architectures randomly
I don’t. NAS can be done with RL or evolutionary computation methods. (Tbc, when I said I model a big part of contemporary ML research as “trial and error”, by trial and error I did not mean random search.)
If you then also model architectures as non-fragile, then once you have some optimization power, adding more optimization power doesn’t make much of a difference,
Earlier in this discussion you defined fragility as the property “if you make even a slight change to the thing, then it breaks and doesn’t work”. While finding fragile solutions is hard, finding non-fragile solution is not necessarily easy, so I don’t follow the logic of that paragraph.
Suppose that all model architectures are indeed non-fragile, and some of them can implement AGI (call them “AGI architectures”). It may be the case that relative to the set of model architectures that we can end up with when using our favorite method (e.g. evolutionary search), the AGI architectures are a tiny subset. E.g. the size ratio can be 10−10 (and then running our evolutionary search 10x times means roughly 10x probability of finding an AGI architecture, if [number of runs]<<1010).
I don’t. NAS can be done with RL or evolutionary computation methods. (Tbc, when I said I model a big part of contemporary ML research as “trial and error”, by trial and error I did not mean random search.)
I do think that similar conclusions apply there as well, though I’m not going to make a mathematical model for it.
finding non-fragile solution is not necessarily easy
I’m not saying it is; I’m saying that however hard it is to find a non-fragile good solution, it is easier to find a solution that is almost as good. When I say
adding more optimization power doesn’t make much of a difference
I mean to imply that the existing optimization power will do most of the work, for whatever quality of solution you are getting.
Suppose that all model architectures are indeed non-fragile, and some of them can implement AGI (call them “AGI architectures”). It may be the case that relative to the set of model architectures that we can end up with when using our favorite method (e.g. evolutionary search), the AGI architectures are a tiny subset. E.g. the size ratio can be 10−10(and then running our evolutionary search 10x times means roughly 10x probability of finding an AGI architecture, if [number of runs]<<1010).
(Aside: it would be way smaller than 10−10.) In this scenario, my argument is that the size ratio for “almost-AGI architectures” is better (e.g. 10−9), and so you’re more likely to find one of those first.
In practice, if you have a thousand parameters that determine an architecture, and 10 settings for each of them, the size ratio for the (assumed unique) globally best architecture is 10−1000. In this setting, I expect several orders of magnitude of difference between the size ratio of almost-AGI and the size ratio of AGI, making it essentially guaranteed that you find an almost-AGI architecture before an AGI architecture.
In this scenario, my argument is that the size ratio for “almost-AGI architectures” is better (e.g. 10−9), and so you’re more likely to find one of those first.
For a “local search NAS” (rather than “random search NAS”) it seems that we should be considering here the set of [“almost-AGI architectures” from which the local search would not find an “AGI architecture”].
The “$1B NAS discontinuity scenario” allows for the $1B NAS to find “almost-AGI architectures” before finding an “AGI architecture”.
For a “local search NAS” (rather than “random search NAS”) it seems that we should be considering here the set of [“almost-AGI architectures” from which the local search would not find an “AGI architecture”].
The “$1B NAS discontinuity scenario” allows for the $1B NAS to find “almost-AGI architectures” before finding an “AGI architecture”.
Agreed. My point is that the $100M NAS would find the almost-AGI architectures. (My point with the size ratios is that whatever criterion you use to say “and that’s why the $1B NAS finds AGI while the $100M NAS doesn’t”, my response would be that “well, almost-AGI architectures require a slightly easier-to-achieve value of <criterion>, that the $100M NAS would have achieved”.)
It wasn’t exactly that (in particular, I didn’t have the researcher’s beliefs in mind), but I also believe that statement for basically the same reasons so that should be fine. There’s a lot of ambiguity in that statement (specifically, what is AGI), but I probably believe it for most operationalizations of AGI.
(For reference, I was considering “will there be a 1 year doubling of economic output that started before the first 4 year doubling of economic output ended”; for that it’s not sufficient to just argue that we will get AGI suddenly, you also have to argue that the AGI will very quickly become superintelligent enough to double economic output in a very short amount of time.)
I mean, the difference between a $100M NAS and a $1B NAS is:
Up to 10x the number of models evaluated
Up to 10x the size of models evaluated
If you increase the number of models by 10x and leave the size the same, that somewhat increases your optimization power. If you model the NAS as picking architectures randomly, the $1B NAS can have at most 10x the chance of finding AGI, regardless of fragility, and so can only have at most 10x the expected “value” (whatever your notion of “value”).
If you then also model architectures as non-fragile, then once you have some optimization power, adding more optimization power doesn’t make much of a difference, e.g. the max of n draws from Uniform([0, 1]) has expected value nn+1=1−1n+1, so once n is already large (e.g. 100), increasing it makes ~no difference. Of course, our actual distributions will probably be more bottom-heavy, but as distributions get more bottom-heavy we use gradient descent / evolutionary search to deal with that.
For the size, it’s possible that increases in size lead to huge increases in intelligence, but that doesn’t seem to agree with ML practice so far. Even if you ignore trend extrapolation, I don’t see a reason to expect that increasing model sizes should mean the difference between not-even-close-to-AGI and AGI.
I don’t. NAS can be done with RL or evolutionary computation methods. (Tbc, when I said I model a big part of contemporary ML research as “trial and error”, by trial and error I did not mean random search.)
Earlier in this discussion you defined fragility as the property “if you make even a slight change to the thing, then it breaks and doesn’t work”. While finding fragile solutions is hard, finding non-fragile solution is not necessarily easy, so I don’t follow the logic of that paragraph.
Suppose that all model architectures are indeed non-fragile, and some of them can implement AGI (call them “AGI architectures”). It may be the case that relative to the set of model architectures that we can end up with when using our favorite method (e.g. evolutionary search), the AGI architectures are a tiny subset. E.g. the size ratio can be 10−10 (and then running our evolutionary search 10x times means roughly 10x probability of finding an AGI architecture, if [number of runs]<<1010).
I do think that similar conclusions apply there as well, though I’m not going to make a mathematical model for it.
I’m not saying it is; I’m saying that however hard it is to find a non-fragile good solution, it is easier to find a solution that is almost as good. When I say
I mean to imply that the existing optimization power will do most of the work, for whatever quality of solution you are getting.
(Aside: it would be way smaller than 10−10.) In this scenario, my argument is that the size ratio for “almost-AGI architectures” is better (e.g. 10−9), and so you’re more likely to find one of those first.
In practice, if you have a thousand parameters that determine an architecture, and 10 settings for each of them, the size ratio for the (assumed unique) globally best architecture is 10−1000. In this setting, I expect several orders of magnitude of difference between the size ratio of almost-AGI and the size ratio of AGI, making it essentially guaranteed that you find an almost-AGI architecture before an AGI architecture.
For a “local search NAS” (rather than “random search NAS”) it seems that we should be considering here the set of [“almost-AGI architectures” from which the local search would not find an “AGI architecture”].
The “$1B NAS discontinuity scenario” allows for the $1B NAS to find “almost-AGI architectures” before finding an “AGI architecture”.
Agreed. My point is that the $100M NAS would find the almost-AGI architectures. (My point with the size ratios is that whatever criterion you use to say “and that’s why the $1B NAS finds AGI while the $100M NAS doesn’t”, my response would be that “well, almost-AGI architectures require a slightly easier-to-achieve value of <criterion>, that the $100M NAS would have achieved”.)