I honestly think that’s pretty irrelevant since the data are purely produced through non-reproducible intuited or verbally justified parameter estimates, so intentionality is completely opaque to readers, which is really the bigger core problem with the exercise.
Okay, we have a more fundamental disagreement than I previously realized. Can you tell me which step of the following argument you disagree with:
It’s important to forecast AGI timelines.
In order to forecast AGI timelines while taking into account all significantly relevant factors, one needs to use intuitive estimations for several important parameters.
Despite (2), it’s important to try anyways and make the results public to (a) advance the state of public understanding of AGI timelines and (b) make our reasoning for our timelines more transparent than if we gave a vague justification with no model.
Therefore, it’s a good and important thing to make and publish models like ours
I have nothing against publishing models like this! I think the focus should be on modeling assumptions like super-exponentiality and treated as modeling exercises first-and-foremost. Assumptions that highly constrain the model should be first and foremost rather than absent from publicly facing write-ups and only in appendices. Perturbation tests should be conducted and publicly shown. It is true that I don’t think this represents high-quality quantitative work, but that doesn’t mean I don’t think any of it is relevant, even if I think the presentation does not demonstrate a good sense of responsibility. The hype and advertising-ness of the presentation is consistent, and while I have been trying to avoid particular comment thereon, it is a constant weight hanging over the discussion. Publication of a model and breathless speculation are not the same, and I really do not want to litigate the speculative aspects but they are the main thing everyone is talking about including in the New York Times.
I think this is not true, but I don’t think it is worth litigating separately from the research progress super-exponential factor which is also modeled as already happening. so it is easeir (though I don’t think easy) to defend both individually based on small potatoes evidence of a speedup, but I can’t imagine how both could be defended at once, so that has to come first so we don’t double count evidence.
I don’t understand why you think we shouldn’t account for research progress speedups due to AI systems on the way to superhuman coder. And this is clearly a separate dynamic from the trend being possibly superexopnential without these speedups taken into account (which, again, we already see some empirical evidence for this possibility). I’d appreciate if you made object-level arguments against one or both of these factors being included.
I think this has a relatively simple crux
Research progress has empirically slowed according to your own source of EpochAI (Edit: as a percentage of all progress which is the term used in the model. It is constant in absolute terms)
You assume the opposite which would explain away the METR speed-up on its own
So justifying the additional super-exponential term with the METR speed-up is double counting evidence
Then, split out the calculate_base_time function from the calculate_sc_arrival_year function and plot those results (here is my snippet to adjust the base time to decimal year and limit to the super-exponential group: base_time_in_months[se_mask] / 12 + 2025 ). You’ll have to do some simple wrangling to expose the se_mask and get the plotting function to work but nothing substantive.
Could you share the code on a fork or PR?
I can drop my versions of the files in a drive or something, I’m just editing on the fly, so they aren’t intended to be snapshotted, but here you go!
Imagine if climate researchers had made a model where carbon impact on the environment gets twice-to-infinity times larger in 2028 based on “that’s what we think might be happening in Jupiter.”
A few thoughts:
Climate change forecasting is fundamentally more amenable to grounded quantitative modeling than AGI forecasting, and even there my impression is that there’s substantial disagreement based on qualitative arguments regarding various parameter settings (though I’m very far from an expert on this).
Forecasts which include intuitive estimations are commonplace and often useful (see e.g. intelligence analysis, Superforecasting, prediction markets, etc.).
I’d be curious if you have the same criticism of previous timelines forecasts like Bio Anchors.
I don’t understand your analogy re: Jupiter. In the timelines model, we are trying to predict what will happen in the real world on Earth.
4. The analogy is that the justification uses analogy to humans which are notably not LLMs. If it is purely based on the METR “speed-up” that should be made more clear as it is a (self-admittedly) weak argument.
(I offered to call with Peter because I’m having trouble understanding his positions over text-based discussion. edit: looks like we will call)
I will briefly say re:
Assumptions that highly constrain the model should be first and foremost rather than absent from publicly facing write-ups and only in appendices. Perturbation tests should be publicly shown as should variances in assumptions. It is true that I don’t think this represents high-quality quantitative work, but that doesn’t mean I don’t think any of it is relevant, even if I think the presentation does not demonstrate a good sense of responsibility. The hype and advertising-ness of the presentation is consistent, and while I have been trying to avoid particular comment thereon, it is a constant weight hanging over the discussion. Publication of a model and breathless speculation are not the same, and I really do not want to litigate the speculative aspects but they are the main thing everyone is talking about including in the New York Times.
I agree in an ideal world the timelines model would be more rigorous with better structure, more ablation studies, more justifications, etc. However this would have taken a lot longer and my guess is that we would have ended up with similar forecasts in terms of their qualitative conclusions about how plausible AGI in/by 2027 is (though of course, I’m not certain). We didn’t want to let perfect be the enemy of the good/useful (i.e. good/useful in our opinion; seems like the crux may be you disagree that it’s useful).
Apologies if any part of the media stuff seemed like it overclaimed regarding our work.
“we would have ended up with similar forecasts in terms of their qualitative conclusions about how plausible AGI in/by 2027 is” if this is with regard to perturbation studies of the two super-exponential terms, I believe based on the work I’ve shared in part here it is false, but happy to agree to disagree on the rest :)
I’m curious to hear what conclusions you think we would have came to & should come to. I’m skeptical that they would have been qualitatively different. Perhaps you are going to argue that we shouldn’t put much credence in the superexponential model? What should we put it in instead? Got a better superexponential model for us? Or are you going to say we should stick to exponential?
Thanks for engaging, I’m afraid I can’t join the call due to a schedule conflict but I look forward to hearing about it from Eli!
Ah, I was trying to avoid implying a lack of integrity in the forecasting effort, and as a result ended up implying knowing other things about your mental state.
To restate: I do not think a forecaster not previously committed to achieving a result with a median around that point would have, after doing a perturbation analysis that would display how dominant the two superexponential terms are in predetermining that specific outcome, presented those conclusions without hammering the point that those two superexponential terms are completely determining the topline result and that no other parameters make a meaningful contribution, whether forecasts of compute availability, current capabilities, capabilities required for SC, or otherwise.
Sorry if my previous comment implied something else!
Assumptions that highly constrain the model should be first and foremost rather than absent from publicly facing write-ups and only in appendices.
Strongly agree — cf. nostalgebraist’s posts making this point on the bio anchors and AI 2027 models. I have the sense this is a pretty fundamental epistemic crux between camps of people making predictions (or suspending judgment!) about AI takeoff.
Yeah, at this point the marginal value add of forecasting/epistemics is in validating/invalidating fundamental assumptions like the software intelligence explosion idea, or the possibility of industrial/chip factory being massively scaled up by AIs, or Moore’s law not ending, rather than on the parameter ranges, because the assumptions overdetermine the conclusion.
Overall, I agree that we were not prioritizing addressing the demographic of skeptics who have detailed reasons for their beliefs. I very much sympathize with disagreeing the framework that others are using to approach takeoff forecasting rather than just parameter settings, I feel similarly to some extent about others’ work (in particular Epoch’s GATE).
However, I disagree that our model assumes a software-driven intelligence explosion. A substantial percentage of our simulations don’t contain such an explosion! You can see that for example the 90th percentile for ASI is >2100. You can totally input your best guesses for the parameters in our model and end up with <50% on a software-driven explosion.
I also think that our scenario and supplements have convinced some skeptics who didn’t have as sophisticated as reasons as e.g. you and Epoch for their disagreements going in. But of course you probably think this is a bad thing so :shrug:
And I’m also worried – as always with this stuff – that there are some people who will look at all those pages and pages of fancy numbers, and think “wow! this sounds crazy but I can’t argue with Serious Expert Research™,” and end up getting convinced even though the document isn’t really trying to convince them in the first place.
Also very much sympathize with this! I’ve had similar concerns about other approaches. I aimed to try to be clear about our levels of uncertainty and how scrappy the model is but perhaps I could have added further caveats? Curious what you would have recommended here.
Finally: I’d be quite interested in your object-level disagreements with our takeoff forecasts. I’ve appreciated your comments on timelines. I’d also be interested in your actual forecasts on timelines/takeoff. Like e.g. your 10/50/90th percentiles, as we gave in our supplements.
I have nothing against publishing models like this! I think the focus should be on modeling assumptions like super-exponentiality and treated as modeling exercises first-and-foremost. Assumptions that highly constrain the model should be first and foremost rather than absent from publicly facing write-ups and only in appendices. Perturbation tests should be conducted and publicly shown. It is true that I don’t think this represents high-quality quantitative work, but that doesn’t mean I don’t think any of it is relevant, even if I think the presentation does not demonstrate a good sense of responsibility. The hype and advertising-ness of the presentation is consistent, and while I have been trying to avoid particular comment thereon, it is a constant weight hanging over the discussion. Publication of a model and breathless speculation are not the same, and I really do not want to litigate the speculative aspects but they are the main thing everyone is talking about including in the New York Times.
I think this has a relatively simple crux
Research progress has empirically slowed according to your own source of EpochAI (Edit: as a percentage of all progress which is the term used in the model. It is constant in absolute terms)
You assume the opposite which would explain away the METR speed-up on its own
So justifying the additional super-exponential term with the METR speed-up is double counting evidence
I can drop my versions of the files in a drive or something, I’m just editing on the fly, so they aren’t intended to be snapshotted, but here you go!
https://drive.google.com/drive/folders/1_e-hnmSy2FoMSD4UiWKWKNKnsapZDm1M?usp=drive_link
Edit:
4. The analogy is that the justification uses analogy to humans which are notably not LLMs. If it is purely based on the METR “speed-up” that should be made more clear as it is a (self-admittedly) weak argument.
(I offered to call with Peter because I’m having trouble understanding his positions over text-based discussion. edit: looks like we will call)
I will briefly say re:
I agree in an ideal world the timelines model would be more rigorous with better structure, more ablation studies, more justifications, etc. However this would have taken a lot longer and my guess is that we would have ended up with similar forecasts in terms of their qualitative conclusions about how plausible AGI in/by 2027 is (though of course, I’m not certain). We didn’t want to let perfect be the enemy of the good/useful (i.e. good/useful in our opinion; seems like the crux may be you disagree that it’s useful).
Apologies if any part of the media stuff seemed like it overclaimed regarding our work.
Looking forward to getting clarity!
“we would have ended up with similar forecasts in terms of their qualitative conclusions about how plausible AGI in/by 2027 is” if this is with regard to perturbation studies of the two super-exponential terms, I believe based on the work I’ve shared in part here it is false, but happy to agree to disagree on the rest :)
I’m curious to hear what conclusions you think we would have came to & should come to. I’m skeptical that they would have been qualitatively different. Perhaps you are going to argue that we shouldn’t put much credence in the superexponential model? What should we put it in instead? Got a better superexponential model for us? Or are you going to say we should stick to exponential?
Thanks for engaging, I’m afraid I can’t join the call due to a schedule conflict but I look forward to hearing about it from Eli!
Ah, I was trying to avoid implying a lack of integrity in the forecasting effort, and as a result ended up implying knowing other things about your mental state.
To restate: I do not think a forecaster not previously committed to achieving a result with a median around that point would have, after doing a perturbation analysis that would display how dominant the two superexponential terms are in predetermining that specific outcome, presented those conclusions without hammering the point that those two superexponential terms are completely determining the topline result and that no other parameters make a meaningful contribution, whether forecasts of compute availability, current capabilities, capabilities required for SC, or otherwise.
Sorry if my previous comment implied something else!
Strongly agree — cf. nostalgebraist’s posts making this point on the bio anchors and AI 2027 models. I have the sense this is a pretty fundamental epistemic crux between camps of people making predictions (or suspending judgment!) about AI takeoff.
Yeah, at this point the marginal value add of forecasting/epistemics is in validating/invalidating fundamental assumptions like the software intelligence explosion idea, or the possibility of industrial/chip factory being massively scaled up by AIs, or Moore’s law not ending, rather than on the parameter ranges, because the assumptions overdetermine the conclusion.
Comment down below:
https://forum.effectivealtruism.org/posts/rv4SJ68pkCQ9BxzpA/?commentId=EgqgffC4F5yZQreCp
I probably agree with you guys more than you realize. You guys might be interested in this DM I sent to nostalgebraist: