Researcher at Epoch AI
@justjoshinyou13 on twitter
Researcher at Epoch AI
@justjoshinyou13 on twitter
Another point here is that elections are an additional check after the courts, Congress, etc. US presidential elections are not administered by the federal government, they are administered by the states. So to interfere with elections, the president can’t just fill election boards with cronies or give orders to anyone in his chain of command to rig the election. He’d have to forcibly manipulate or interfere with state officials and state governments, risking direct conflict with states. And if he doesn’t interfere with the election and the states announce results showing he lost in a landslide, his political power almost certainly evaporates. Of course, if all the president’s crazy actions are in fact popular, then he is much more likely to succeed and stay in power for many reasons.
Again, if you assume the military always slavishly follows the president, then this ends up in a civil war with a plausible military victory for the president. But each escalation into “this is obviously illegitimate” means the president increasingly offends his generals’ sense of duty, decreases the probability of success and increases the legal and political risk for the officers following his orders, increases the size and motivation of the inevitable popular resistance, etc.
In-practice most federal offices have deferred to what the Supreme Court says, but we haven’t really seen what happens when e.g. a sitting president insists on an interpretation of the constitution that disagrees, and the constitution itself provides no clear answer.
This is a somewhat confusing statement. To be clear, it’s extremely common for the president to disagree with courts on the law or Constitution: this happens dozens of times per presidential term. And when they lose in court the president may declare that they still think they are right and the Court ruled incorrectly. But this wouldn’t cause a constitutional crisis or anything by default: the president almost always follows court orders or court opinions. It’s a very ingrained norm in the US that court orders, especially from the Supreme Court, are binding.
(relevant thread from a lawyer early last year on the powers and tools that courts have to force a president or other federal officials to follow their court orders, such as freezing assets).
I think there’s a lot of reasoning here that effectively goes “if the president has absolute power such that the military and federal officers will always listen to his orders, then the US legal system will have trouble reigning him in.” Which is kind of just begging the question. But somewhere in the chain of events you suggest, the president would break a lot of clear red lines and probably lose nearly all of his political support from the general population and the powerful elements of society, unless the he has already broadly persuaded people that his power-grabbing actions are actually a good idea.
Yeah I think leading labs generally retrain their base models less often than every 6 months (but there’s a lot we don’t know for sure). And I believe this most likely has to do with a production AI model being the result of a lot of careful tuning on pre-training, mid-training, post-training etc. Swapping in a new base model might lead to a lot of post-training regressions that need to be fixed. And your old base model is a “lucky” one in some sense because it either was selected for doing well and/or it required lots experiments, derisking runs, etc. Even with all of your new algorithmic tricks it might be hard to one-shot YOLO a base model that’s better than your SOTA model from nine months ago. But this is probably much easier for your model from 18 or 27 months ago.
Also I’d guess staff costs are more important than compute costs here but these considerations mean compute costs of retraining are higher than one might think.
you should be more uncertain about the METR benchmark’s external validity than what these error bars show.
but your baseline uncertainty about key facts about AI progress in general should also often span much more than one order of magnitude between your 2.5th percentile and 97.5th percentile guess. the METR results add a lot of value and I don’t think these error bars are a big deal in the scheme of things.
Most successful startups slow down a lot after a brief hypergrowth phase. We should be looking for signs that AI companies like OpenAI and Anthropic* are experiencing unusually long and persistent hypergrowth: surprisingly little slowdown in growth, or maintaining >2x growth/year at surprisingly high revenue levels like 100B. They are both already growing very surprisingly fast for companies with multiple billions in revenue, to be clear, but whether that continues is valuable evidence.
This could be a sign that present-day models have a higher economic ceiling than we realize (closer to TAI than they might look), or that companies are making real progress towards transformative AI. Most companies don’t dramatically improve their product lineup over and over again after they find initial product-market-fit, so sustained rapid growth means that AI development is leading to a new batch of successful products on a regular basis, i.e. escalating economic usefulness.
*I think companies that serve AI to end-users are the most useful indicators
I’d flip it around and ask whether Gabriel thinks the best models from 6, 12, or 18 months ago could be performing at today’s level with maximum elicitation.
I think the linked tweet is possibly just misinterpreting what the authors meant by “transistor operations”? My reading is that “1000″ binds to “operations”; the actual number of transistors in each operation is unspecified. That’s how they get the 10,000x number—if a CPU runs at 1 GHz, neurons run at 100 Hz, then even if it takes 1000 clock cycles to do the work of neuron, the CPU can still do it 10,000x faster.
Hmm I see it. I thought it was making a distinct argument from the one Ege was responding to here, but if you’re right it’s the same one.
Then the claim is that an AI run on some (potentially large) cluster of GPUs can think far faster than any human in serial speed. You do lose the rough equivalency between transistors and neurons: a GPU, which is roughly equal to a person in resource costs, happens to have about the same number of transistors as a human brain has neurons. It’s potentially a big deal that AI has a much faster maximum serial speed than humans, but it’s far from clear that such an AI can outwit human society.
OpenAI can probably achieve Meta/Google-style revenue just from monetizing free users, since they’re already one of the biggest platforms in the world, with a clear path to increasing eyeballs through model progress+new modalities and use cases, and building up an app ecosystem (e.g. their widely rumored web browser). An anonymous OpenAI investor explains the basic logic:
The investor argues that the math for investing at the $500 billion valuation is straightforward: Hypothetically, if ChatGPT hits 2 billion users and monetizes at $5 per user per month—“half the rate of things like Google or Facebook”—that’s $120 billion in annual revenue.
However, this might take a long time to fully realize, perhaps like 10 years?
Google DeepMind uses Nvidia very sparingly if at all. AlphaFold 3 was trained using A100s but that’s the only recent use of Nvidia by GDM I’ve heard of. I think Google proper, outside GDM, primarily uses TPUs over GPUs for internal workloads, but I’m less sure about that.
Google does buy a lot of Nvidia chips for its cloud division, to rent out to other companies.
xAI and Meta still use Nvidia. Almost every non-frontier-lab, non-Chinese AI chip consumer uses Nvidia
And Alphabet, Amazon, and Broadcom, the companies that design TPU and Trainium, have the 4th, 5th, and 7th biggest market caps in the world.
I think it’s possible that the market is underpricing how big a deal Anthropic and Google DeepMind, and other frontier labs that might follow in their footsteps, are for overall AI chip demand. But it’s not super obvious.
I’m saying it would be challenging for Nvidia to preserve its high share of AI compute production in the first place while trying to execute this strategy. Nvidia is fabless, and its dominance will erode if labs/hyperscalers/Broadcom create satisfactory designs and are willing to place sufficient large orders with TSMC.
Nvidia already has an AI cloud division that is not negligible but small compared to the big players. But they appear to not even own their own chips: they lease from Oracle.
I am skeptical of this because they can’t just scale up data centers on a dime. And signaling that they are trying to become the new biggest hyperscaler would be risky for their existing sales: big tech and frontier labs will go even harder for custom chips than they are now.
To make this happen Nvidia would probably need to partner with neoclouds like CoreWeave that have weaker affiliations with frontier labs. Nvidia is actively incubating neoclouds and does have very strong relationships here, to be sure, but the neoclouds still have fewer data centers and less technical expertise than the more established hyperscalers.
And I think algorithms and talent are very important.
Personally it will be impossible for me to ignore the part of me that wonders “is this AGI/ASI stuff actually, for real, coming, or will it turn out to be fake.” Studying median timelines bleeds into the question of whether AGI by my natural lifespan is 90% likely or 99.5% likely, and vice versa. So I will continue thinking very carefully about evidence of AGI progress.
But I’ve been increasingly starting to wonder if software engineering might not be surprisingly easy to automate when the right data/environments are used at much larger scale
I’ve had similar thoughts: I think there’s still low-hanging fruit in RL, and in scaffolding and further scaling of inference compute. But my general take is that the recent faster trend of doubling every ~4 months is already the result of picking the low-hanging RL fruit for coding and SWE, and fast inference scaling. So this kind of thing will probably lead to a continuation of the fast trend, not another acceleration.
Another source of shorter timelines, depending on what timeline you mean, is the uncertainty from translating time horizon to real-world AI research productivity. Maybe models with an 80% time horizon of 1 month or less are already enough for a huge acceleration of AI R&D, with the right scaffold/unhobbling/bureaucracy that can take advantage of lots of parallel small experiments or other work, or good complementarities between AI and human labor,
In any case, the paper says the curtailments would last about two hours each:
The average duration of load curtailment (i.e., the length of time the new load is curtailed during curtailment events) would be relatively short, at 1.7 hours when average annual load curtailment is limited to 0.25%, 2.1 hours at a 0.5% limit, and 2.5 hours at a 1.0% limit
Demand response could be done by covering the data center with battery energy or not. Demand response and batteries can stack: if the grid is really stressed, a data center can both turn off and discharge its battery into the grid.
Economically, it makes sense to accept some true downtime to avoid months-long delays in data center construction. This is clearly true for training workloads which are very important but don’t have live demand. But downtime for even inference clusters is acceptable: you can reduce the compute demand by temporarily slowing down token generation, or use dynamic rate limits. And any curtailment would almost certainly be isolated to one region, so inference data centers in other places would still be operational.
Internal models aren’t 6 months ahead in general.
Sometimes internal models are several months ahead in key benchmarks or capabilities. For example, an internal OpenAI model won gold on IMO but it might be a while before a public OpenAI model does as well at IMO or other math competitions. But you wouldn’t want to use this model, and I don’t think OpenAI uses the model a lot internally.
Also Anthropic is probably a few months ahead of OpenAI in coding.
GPT-5′s time horizon curve for reference. Actually looks a bit smoother than the curves for previous models.
A good term for 10^20 FLOP would be useful. This would make modern models around 100k to 10 million of this unit, which is a tangible number. Some people, e.g. at DeepMind tried to make “petaflop-days” (8.64e19) a thing but it didn’t catch on.