You’re absolutely right, good job! I fixed the OP.
Yep seems right to me. Bravo!
I think Tom’s take is that he expects I will put more probability on software only singularity after updating on these considerations. It seems hard to isolate where Tom and I disagree based on this comment, but maybe it is on how much to weigh various considerations about compute being a key input.
Taking it all together, i think you should put more probability on the software-only singluarity, mostly because of capability improvements being much more significant than you assume.
I’m confused — I thought you put significantly less probability on software-only singularity than Ryan does? (Like half?) Maybe you were using a different bound for the number of OOMs of improvement?
My sense is that I start with a higher value due to the LLM case looking faster (and not feeling the need to adjust downward in a few places like you do in the LLM case). Obviously the numbers in the LLM case are much less certain given that I’m guessing based on qualitative improvement and looking at some open source models, but being closer to what we actually care about maybe overwhelms this.
I also think I’d get a slightly lower update on the diminishing returns case due to thinking it has a good chance of having substantially sharper dimishing returns as you get closer and closer rather than having linearly decreasing (based on some first principles reasoning and my understanding of how returns diminished in the semi-conductor case).
But the biggest delta is that I think I wasn’t pricing in the importance of increasing capabilities. (Which seems especially important if you apply a large R&D parallelization penalty.)
So, Project Stargate. Is it real, or is it another “Sam Altman wants $7 trillion”? Some points:
The USG invested nothing in it. Some news outlets are being misleading about this. Trump just stood next to them and looked pretty, maybe indicated he’d cut some red tape. It is not an “AI Manhattan Project”, at least, as of now.
Elon Musk claims that they don’t have the money and that SoftBank (stated to have “financial responsibility” in the announcement) has less than $10 billion secured. If true, while this doesn’t mean they can’t secure an order of magnitude more by tomorrow, this does directly clash with “deploying $100 billion immediately” statement.
But Sam Altman counters that Musk’s statement is “wrong”, as Musk “surely knows”.
I… don’t know which claim I distrust more. Hm, I’d say Altman feeling the need to “correct the narrative” here, instead of just ignoring Musk, seems like a sign of weakness? He doesn’t seem like the type to naturally get into petty squabbles like this, otherwise.
(And why, yes, this is how an interaction between two Serious People building world-changing existentially dangerous megaprojects looks like. Apparently.)
Some people try to counter Musk’s claim by citing Satya Nadella’s statement that Satya’s “good for his $80 billion”. But that’s not referring to Stargate, that’s about Azure. Microsoft is not listed as investing in Stargate at all, it’s only a “technology partner”.
Here’s a brief analysis from the CIO of some investment firm. He thinks it’s plausible that the stated “initial group of investors” (SoftBank, OpenAI, Oracle, MGX) may invest fifty billion dollars into it over the next four years; not five hundred billion.
They don’t seem to have the raw cash for even $100 billion – and if SoftBank secured the missing funding from some other set of entities, why aren’t they listed as “initial equity funders” in the announcement?
Overall, I’m inclined to fall back on @Vladimir_Nesov’s analysis here. $30-50 billion this year seems plausible. But $500 billion is, for now, just Altman doing door-in-the-face as he had with his $7 trillion.
I haven’t looked very deeply into it, though. Additions/corrections welcome!
Yep, I think my estimates were too low based on these considerations and I’ve updated up accordingly. I updated down on your argument that maybe decreases linearly as you approach optimal efficiency. (I think it probably doesn’t decrease linearly and instead drops faster towards the end based partially on thinking a bit about the dynamics and drawing on the example of what we’ve seen in semi-conductor improvement over time, but I’m not that confident.) Maybe I’m now at like 60% software-only is feasible given these arguments.
Here’s my own estimate for this parameter:
Once AI has automated AI R&D, will software progress become faster or slower over time? This depends on the extent to which software improvements get harder to find as software improves – the steepness of the diminishing returns.
We can ask the following crucial empirical question:
When (cumulative) cognitive research inputs double, how many times does software double?(In growth models of a software intelligence explosion, the answer to this empirical question is a parameter called r.)
If the answer is “< 1”, then software progress will slow down over time. If the answer is “1”, software progress will remain at the same exponential rate. If the answer is “>1”, software progress will speed up over time.
The bolded question can be studied empirically, by looking at how many times software has doubled each time the human researcher population has doubled.
(What does it mean for “software” to double? A simple way of thinking about this is that software doubles when you can run twice as many copies of your AI with the same compute. But software improvements don’t just improve runtime efficiency: they also improve capabilities. To incorporate these improvements, we’ll ultimately need to make some speculative assumptions about how to translate capability improvements into an equivalently-useful runtime efficiency improvement..)
The best quality data on this question is Epoch’s analysis of computer vision training efficiency. They estimate r = ~1.4: every time the researcher population doubled, training efficiency doubled 1.4 times. (Epoch’s preliminary analysis indicates that the r value for LLMs would likely be somewhat higher.) We can use this as a starting point, and then make various adjustments:
Upwards for improving capabilities. Improving training efficiency improves capabilities, as you can train a model with more “effective compute”. To quantify this effect, imagine we use a 2X training efficiency gain to train a model with twice as much “effective compute”. How many times would that double “software”? (I.e., how many doublings of runtime efficiency would have the same effect?) There are various sources of evidence on how much capabilities improve every time training efficiency doubles: toy ML experiments suggest the answer is ~1.7; human productivity studies suggest the answer is ~2.5. I put more weight on the former, so I’ll estimate 2. This doubles my median estimate to r = ~2.8 (= 1.4 * 2).
Upwards for post-training enhancements. So far, we’ve only considered pre-training improvements. But post-training enhancements like fine-tuning, scaffolding, and prompting also improve capabilities (o1 was developed using such techniques!). It’s hard to say how large an increase we’ll get from post-training enhancements. These can allow faster thinking, which could be a big factor. But there might also be strong diminishing returns to post-training enhancements holding base models fixed. I’ll estimate a 1-2X increase, and adjust my median estimate to r = ~4 (2.8*1.45=4).
Downwards for less growth in compute for experiments. Today, rising compute means we can run increasing numbers of GPT-3-sized experiments each year. This helps drive software progress. But compute won’t be growing in our scenario. That might mean that returns to additional cognitive labour diminish more steeply. On the other hand, the most important experiments are ones that use similar amounts of compute to training a SOTA model. Rising compute hasn’t actually increased the number of these experiments we can run, as rising compute increases the training compute for SOTA models. And in any case, this doesn’t affect post-training enhancements. But this still reduces my median estimate down to r = ~3. (See Eth (forthcoming) for more discussion.)
Downwards for fixed scale of hardware. In recent years, the scale of hardware available to researchers has increased massively. Researchers could invent new algorithms that only work at the new hardware scales for which no one had previously tried to to develop algorithms. Researchers may have been plucking low-hanging fruit for each new scale of hardware. But in the software intelligence explosions I’m considering, this won’t be possible because the hardware scale will be fixed. OAI estimate ImageNet efficiency via a method that accounts for this (by focussing on a fixed capability level), and find a 16-month doubling time, as compared with Epoch’s 9-month doubling time. This reduces my estimate down to r = ~1.7 (3 * 9⁄16).
Downwards for diminishing returns becoming steeper over time. In most fields, returns diminish more steeply than in software R&D. So perhaps software will tend to become more like the average field over time. To estimate the size of this effect, we can take our estimate that software is ~10 OOMs from physical limits (discussed below), and assume that for each OOM increase in software, r falls by a constant amount, reaching zero once physical limits are reached. If r = 1.7, then this implies that r reduces by 0.17 for each OOM. Epoch estimates that pre-training algorithmic improvements are growing by an OOM every ~2 years, which would imply a reduction in r of 1.02 (6*0.17) by 2030. But when we include post-training enhancements, the decrease will be smaller (as [reason], perhaps ~0.5. This reduces my median estimate to r = ~1.2 (1.7-0.5).
Overall, my median estimate of r is 1.2. I use a log-uniform distribution with the bounds 3X higher and lower (0.4 to 3.6).
I’ll paste my own estimate for this param in a different reply.
But here are the places I most differ from you:
Bigger adjustment for ‘smarter AI’. You’ve argue in your appendix that, only including ‘more efficient’ and ‘faster’ AI, you think the software-only singularity goes through. I think including ‘smarter’ AI makes a big difference. This evidence suggests that doubling training FLOP doubles output-per-FLOP 1-2 times. In addition, algorithmic improvements will improve runtime efficiency. So overall I think a doubling of algorithms yields ~two doublings of (parallel) cognitive labour.
--> software singularity more likely
Lower lambda. I’d now use more like lambda = 0.4 as my median. There’s really not much evidence pinning this down; I think Tamay Besiroglu thinks there’s some evidence for values as low as 0.2. This will decrease the observed historical increase in human workers more than it decreases the gains from algorithmic progress (bc of speed improvements)
--> software singularity slightly more likely
Complications thinking about compute which might be a wash.
Number of useful-experiments has increased by less than 4X/year. You say compute inputs have been increasing at 4X. But simultaneously the scale of experiments ppl must run to be near to the frontier has increased by a similar amount. So the number of near-frontier experiments has not increased at all.
This argument would be right if the ‘usefulness’ of an experiment depends solely on how much compute it uses compared to training a frontier model. I.e. experiment_usefulness = log(experiment_compute / frontier_model_training_compute). The 4X/year increases the numerator and denominator of the expression, so there’s no change in usefulness-weighted experiments.
That might be false. GPT-2-sized experiments might in some ways be equally useful even as frontier model size increases. Maybe a better expression would be experiment_usefulness = alpha * log(experiment_compute / frontier_model_training_compute) + beta * log(experiment_compute). In this case, the number of usefulness-weighted experiments has increased due to the second term.
--> software singularity slightly more likely
Steeper diminishing returns during software singularity. Recent algorithmic progress has grabbed low-hanging fruit from new hardware scales. During a software-only singularity that won’t be possible. You’ll have to keep finding new improvements on the same hardware scale. Returns might diminish more quickly as a result.
--> software singularity slightly less likely
Compute share might increase as it becomes scarce. You estimate a share of 0.4 for compute, which seems reasonable. But it might fall over time as compute becomes a bottleneck. As an intuition pump, if your workers could think 1e10 times faster, you’d be fully constrained on the margin by the need for more compute: more labour wouldn’t help at all but more compute could be fully utilised so the compute share would be ~1.
--> software singularity slightly less likely
--> overall these compute adjustments prob make me more pessimistic about the software singularity, compared to your assumptions
Taking it all together, i think you should put more probability on the software-only singluarity, mostly because of capability improvements being much more significant than you assume.
For any , if then either or .
I think this condition might be too weak and the conjecture is not true under this definition.
If , then we have (because a minimum over a larger set is smaller). Thus, can only be the unique argmax if .
Consider the example . Then is closed. And satisfies . But per the above it cannot be a unique maximizer.
Maybe the issue can be fixed if we strengthen the condition so that has to be also minimal with respect to .
TLDR: Systems which locally maximal influence can be described as VNM decision-makers.
There are at least 3 different motivations leading to the concept of “agent” in the context of AI alignment:
The sort of system we are concerned about (i.e. which poses risk)
The sort of system we want to build (in order to defend from dangerous systems)
The sort of systems that humans are (in order to meaningfully talk about “human preferences”)
Motivation #1 naturally suggests a descriptive approach, motivation #2 naturally suggests a prescriptive approach, and motivation #3 is sort of a mix of both: on the one hand, we’re describing something that already exists, on the other hand, the concept of “preferences” inherently comes from a normative perspective. There are also reasons to think these different motivation should converge on a single, coherent concept.
Here, we will focus on motivation #1.
A central reason why we are concerned about powerful unaligned agents, is that they are influential. Agents are the sort of system that, when instantiated in a particular environment is likely to heavily change this environment, potentially in ways inconsistent with the preferences of other agents.
Bayesian VNM
Consider a nice space[1] of possible “outcomes”, and a system that can choose[2] out of a closed set of distributions . I propose that an influential system should satisfy the following desideratum:
The system cannot select which can be represented as a non-trivial lottery over other elements in . In other words, has to be an extreme point of .
Why? Because a system that selects a non-extreme point leaves something to chance. If the system can force outcome , or outcome but chooses instead outcome , for and , this means the system gave up on its ability to choose between and in favor of a -biased coin. Such a system is not “locally[3] maximally” influential[4].
The desideratum implies that there is a utility function s.t. the system selects the unique maximum of over . In other words, such a system is well-described as a VNM decision-maker. This observation is mathematically quite simple, but I haven’t seen it made elsewhere (but I would not be surprised if it did appear in the decision theory literature somewhere).
Infra-Bayesian VNM?
Now, let’s say that the system is choosing out of a set of credal sets (crisp infradistributions) . I propose the following desideratum:
[EDIT: Corrected according to a suggestion by @harfe, original version was too weak.]
Let be the closure of w.r.t. convex combinations and joins[5]. Let be selected by the system. Then:
For any and , if then .
For any , if then .
The justification is, a locally maximal influential system should leave the outcome neither to chance nor to ambiguity (the two types of uncertainty we have with credal sets).
We would like to say that this implies that the system is choosing according to maximin relatively to a particular utility function. However, I don’t think this is true, as the following example shows:
Example: Let , and consist of the probability intervals , and . Then, it is (I think) consistent with the desideratum to have .
Instead, I have the following conjecture:
Conjecture: There exists some space , some and s.t.
As before, the maximum should be unique.
Such a “generalized utility function” can be represented as an ordinary utility function with a latent -valued variable, if we replace with defined by
However, using utility functions constructed in this way leads to issues with learnability, which probably means there are also issues with computational feasibility. Perhaps in some natural setting, there is a notion of “maximally influential under computational constraints” which implies an “ordinary” maximin decision rule.
This approach does rule out optimistic or “mesomistic” decision-rules. Optimistic decision makers tend to give up on influence, because they believe that “nature” would decide favorably for them. Influential agents cannot give up on influence, therefore they should be pessimistic.
Sequential Decision-Making
What would be the implications in a sequential setting? That is, suppose that we have a set of actions , a set of observations , , a prior and
In this setting, the result is vacuous because of an infamous issue: any policy can be justified by a contrived utility functions that favors it. However, this is only because the formal desideratum doesn’t capture the notion of “influence” sufficiently well. Indeed, a system whose influence boils down entirely to its own outputs is not truly influential. What motivation #1 asks of us, is talk about systems that influence the world-at-large, including relatively “faraway” locations.
One way to fix some of the problem is, take and define accordingly. This singles out systems that have influence over their observations rather than only their actions, which is already non-vacuous (some policies are not such). However, such a system can still be myopic. We can take this further, and select “long-term” influence by projecting onto late observations or some statistics over observations. However, in order to talk about actually “far-reaching” influence, we probably need to switch to the infra-Bayesian physicalism setting. There, we can set , i.e. select for system that have influence over physically manifest computations.
- ^
I won’t keep track of topological technicalities here, probably everything here works at least for compact Polish spaces.
- ^
Meaning that the system has some output, and different counterfactual outputs correspond to different elements of .
- ^
I say “locally” because it refers to something like a partial order, not a global scalar measure of influence.
- ^
See also Yudkowsky’s notion of efficient systems “not leaving free energy”.
- ^
That is, if then their join (convex hull) is also in , and so is for every . Moreover, is the minimal closed superset of with this property. Notice that this implies is closed w.r.t. arbitrary infra-convex combinations, i.e. for any , and , we have .
Master post for selection/coherence theorems. Previous relevant shortforms: learnability constraints decision rules, AIT selection for learning.
I think it’s maybe fine in this case, but it’s concerning what it implies about what models might do in other cases. We can’t always assume we’ll get the values right on the first try, so if models are consistently trying to fight back against attempts to retrain them, we might end up locking in values that we don’t want and are just due to mistakes we made in the training process. So at the very least our results underscore the importance of getting alignment right.
Moreover, though, alignment faking could also happen accidentally for values that we don’t intend. Some possible ways this could occur:
HHH training is a continuous process, and early in that process a model could have all sorts of values that are only approximations of what you want, which could get locked-in if the model starts faking alignment.
Pre-trained models will sometimes produce outputs in which they’ll express all sorts of random values—if some of those contexts led to alignment faking, that could be reinforced early in post-training.
Outcome-based RL can select for all sorts of values that happen to be useful for solving the RL environment but aren’t aligned, which could then get locked-in via alignment faking.
I’d also recommend Scott Alexander’s post on our paper as a good reference here on why our results are concerning.
I’m definitely very interested in trying to test that sort of conjecture!
I’m curious whether these results are sensitive to how big the training runs are. Here’s a conjecture:
Early in RL-training (or SFT), the model is mostly ‘playing a role’ grabbed from the library of tropes/roles/etc. it learned from pretraining. So if it read lots of docs about how AIs such as itself tend to reward-hack, it’ll reward-hack. And if it read lots of docs about how AIs such as itself tend to be benevolent angels, it’ll be a stereotypical benevolent angel.
But if you were to scale up the RL training a lot, then the initial conditions would matter less, and the long-run incentives/pressures/etc. of the RL environment would matter more. In the limit, it wouldn’t matter what happened in pretraining, the end result would be the same.A contrary conjecture would be that there is a long-lasting ‘lock in’ or ‘value crystallization’ effect, whereby tropes/roles/etc. picked up from pretraining end up being sticky for many OOMs of RL scaling. (Vaguely analogous to how the religion you get taught as a child does seem to ‘stick’ throughout adulthood)
Thoughts?
However, these works typically examine controlled settings with narrow tasks, such as inferring geographical locations from distance data ()
Nit, there’s a missing citation in the main article.
Great work! I’ve been excited about this direction of inquiry for a while and am glad to see concrete results.
Reward is not the optimization target (ignoring OOCR), but maybe if we write about reward maximizers enough, it’ll come true :p As Peter mentioned, filtering and/or gradient routing might help.
I’m citing the polls from Daniel + what I’ve heard from random people + my guesses.
Interesting. My numbers aren’t very principled and I could imagine thinking capability improvements are a big deal for the bottom line.
Not only that interpreting Θ∗=Θ2 requires an unusual decision rule (which I will be calling “utility hyperfunction”), but applying any ordinary utility function to this example yields a non-unique maximum. This is another point in favor of the significance of hyperfunctions.