Some quick thoughts, will follow up later with more once I finish reading and digesting:
--I feel like it’s unfair to downweight the less-compute-needed scenarios based on recent evidence, without also downweighting some of the higher-compute scenarios as well. Sure, I concede that the recent boom in deep learning is not quite as massive as one might expect if one more order of magnitude would get us to TAI. But I also think that it’s a lot bigger than one might expect if fifteen more are needed! Moreover I feel that the update should be fairly small in both cases, because both updates are based on armchair speculation about what the market and capabilities landscape should look like in the years leading up to TAI. Maybe the market isn’t efficient; maybe we really are in an AI overhang.
--If we are in the business of adjusting our weights for the various distributions based on recent empirical evidence (as opposed to more a priori considerations) then I feel like there are other pieces of evidence that argue for shorter timelines. For example, the GPT scaling trends seem to go somewhere really exciting if you extrapolate it four more orders of magnitude or so.
--Relatedly, GPT-3 is the most impressive model I know of so far, and it has only 1/1000th as many parameters as the human brain has synapses. I think it’s not crazy to think that maybe we’ll start getting some transformative shit once we have models with as many parameters as the human brain, trained for the equivalent of 30 years. Yes, this goes against the scaling laws, and yes, arguably the human brain makes use of priors and instincts baked in by evolution, etc. But still, I feel like at least a couple percentage points of probability should be added to “it’ll only take a few more orders of magnitude” just in case we are wrong about the laws or their applicability. It seems overconfident not to. Maybe I just don’t know enough about the scaling laws and stuff to have as much confidence in them as you do.
On down-weighting low-end vs high-end compute levels: The reason that the down-weighting for low-end compute levels was done in a separate and explicit way was just because I think there’s a structural difference between the two updates. When updating against low-end compute levels, I think it makes more sense to do that update within each hypothesis, because only some orders of magnitude are affected. To implement an “update against high-end compute levels”, we can simply lower the probability we assign to high-compute hypotheses, since there is no specific reason to shave off just a few OOMs at the far right. My probability on the Evolution Anchor hypothesis is 10%, and my probability on the Long Horizon Neural Network hypothesis is 15%; this is lower than my probability on the Short Horizon Neural Network hypothesis (20%) and Medium Horizon Neural Network hypothesis (30%) because I feel that the higher-end hypotheses are less consistent with the holistic balance of evidence.
On the GPT scaling trend: I think that the way to express the view that GPT++ would constitute TAI is to heavily weight the Short Horizon Neural Network hypothesis, potentially along with shifting and/or narrowing the range of effective horizon lengths in that bucket to be more concentrated on the low end (e.g. 0.1 to 30 subjective seconds rather than 1 to 1000 subjective seconds).
On getting transformative abilities with 1e15 parameter models trained for 30 subjective years: I think this is pretty unlikely, but not crazy like you said; I think the way to express this view would be to up-weight the Lifetime Anchor hypothesis. My weight on it is currently 5%.Additionally, all the Neural Network hypotheses bake in substantial probability to relatively small models (e.g. 1e12 FLOP/subj sec) and scaling more shallow than we’ve seen demonstrated so far (e.g. an exponent of 0.25).
Thanks! Just as a heads up, I now have read it thoroughly enough that I’ve collected quite a few thoughts about it, and so I intend to make a post sometime in the next week or so giving my various points of disagreement and confusion, including my response to your response here. If you’d rather me do this sooner, I can hustle, and if you’d rather me wait till after the report is out, I can do that too.
Thanks for doing this, this is really good!
Some quick thoughts, will follow up later with more once I finish reading and digesting:
--I feel like it’s unfair to downweight the less-compute-needed scenarios based on recent evidence, without also downweighting some of the higher-compute scenarios as well. Sure, I concede that the recent boom in deep learning is not quite as massive as one might expect if one more order of magnitude would get us to TAI. But I also think that it’s a lot bigger than one might expect if fifteen more are needed! Moreover I feel that the update should be fairly small in both cases, because both updates are based on armchair speculation about what the market and capabilities landscape should look like in the years leading up to TAI. Maybe the market isn’t efficient; maybe we really are in an AI overhang.
--If we are in the business of adjusting our weights for the various distributions based on recent empirical evidence (as opposed to more a priori considerations) then I feel like there are other pieces of evidence that argue for shorter timelines. For example, the GPT scaling trends seem to go somewhere really exciting if you extrapolate it four more orders of magnitude or so.
--Relatedly, GPT-3 is the most impressive model I know of so far, and it has only 1/1000th as many parameters as the human brain has synapses. I think it’s not crazy to think that maybe we’ll start getting some transformative shit once we have models with as many parameters as the human brain, trained for the equivalent of 30 years. Yes, this goes against the scaling laws, and yes, arguably the human brain makes use of priors and instincts baked in by evolution, etc. But still, I feel like at least a couple percentage points of probability should be added to “it’ll only take a few more orders of magnitude” just in case we are wrong about the laws or their applicability. It seems overconfident not to. Maybe I just don’t know enough about the scaling laws and stuff to have as much confidence in them as you do.
Thanks Daniel! Quick replies:
On down-weighting low-end vs high-end compute levels: The reason that the down-weighting for low-end compute levels was done in a separate and explicit way was just because I think there’s a structural difference between the two updates. When updating against low-end compute levels, I think it makes more sense to do that update within each hypothesis, because only some orders of magnitude are affected. To implement an “update against high-end compute levels”, we can simply lower the probability we assign to high-compute hypotheses, since there is no specific reason to shave off just a few OOMs at the far right. My probability on the Evolution Anchor hypothesis is 10%, and my probability on the Long Horizon Neural Network hypothesis is 15%; this is lower than my probability on the Short Horizon Neural Network hypothesis (20%) and Medium Horizon Neural Network hypothesis (30%) because I feel that the higher-end hypotheses are less consistent with the holistic balance of evidence.
On the GPT scaling trend: I think that the way to express the view that GPT++ would constitute TAI is to heavily weight the Short Horizon Neural Network hypothesis, potentially along with shifting and/or narrowing the range of effective horizon lengths in that bucket to be more concentrated on the low end (e.g. 0.1 to 30 subjective seconds rather than 1 to 1000 subjective seconds).
On getting transformative abilities with 1e15 parameter models trained for 30 subjective years: I think this is pretty unlikely, but not crazy like you said; I think the way to express this view would be to up-weight the Lifetime Anchor hypothesis. My weight on it is currently 5%. Additionally, all the Neural Network hypotheses bake in substantial probability to relatively small models (e.g. 1e12 FLOP/subj sec) and scaling more shallow than we’ve seen demonstrated so far (e.g. an exponent of 0.25).
Thanks! Just as a heads up, I now have read it thoroughly enough that I’ve collected quite a few thoughts about it, and so I intend to make a post sometime in the next week or so giving my various points of disagreement and confusion, including my response to your response here. If you’d rather me do this sooner, I can hustle, and if you’d rather me wait till after the report is out, I can do that too.
Thanks! No need to wait for a more official release (that could take a long time since I’m prioritizing other projects).