I think that people overrate bayesian reasoning and underrate “figure out the right ontology”.
Most of the way good thinking happens IMO is by finding and using a good ontology for thinking about some situation, not by probabilistic calculation. When I learned calculus, for example, it wasn’t mostly that I had uncertainty over a bunch of logical statements, which I then strongly updated on learning the new theorems, it was instead that I learned a bunch of new concepts, which I then applied to reason about the world.
I think AI safety generally has much better concepts for thinking about the future of AI than others, and this is a key source of alpha we have. But, there are obviously still a huge number of disagreements remaining within AI safety. I would guess that debates would be more productive if we more explicitly focused on the ontology/framing that each other are using to reason about the situation, and then discussed to what extent that framing captures the dynamics we think are important.
I think it would be good if more people say things like “I think that’s a bad concept, because it obscures consideration X, which is important for thinking about the situation”.
Here are some widely used concepts I think are bad and I wish became less load bearing in AI safety discourse:
“Fast” and “slow” takeoff; takeoff speeds in general. I think these concepts are very unclear and not super natural. There are various operationalizations of these (e.g. Paul’s “slow takeoff” = a gdp doubling over the course of the 4 years before the first single year that gdp doubles). This is obviously arbitrary, and I don’t see why world’s that meet this definition are worthwhile reasoning about separately from worlds that don’t meet this definition. I also think its easy to smuggle in lots of other correlations here, like slow takeoff = people are woken up, or slow takeoff = alignment is much easier, etc.
I feel more excited about talking about things like “I think milestone X will happen at date Y”, or “My median is that milestone X and Z are Y time apart”. For example, I think the concepts of automating coding, automating all AI research, automating ~the whole economy, and increasing earth’s energy output 1000x are all useful capability milestones, and it’s useful to talk about these.
“scheming”. I think the definition of scheming is pretty unclear, and changes a bunch depending on the context.
Under some definitions I believe that the AIs are always going to be scheming, under others, it seems kind of narrow and unnatural.
I somewhat prefer the concepts from the “alignment over time” box in AI 2027.
“gradual disempowerment”
I think that this conflates a bunch of scenarios / threat models together, some of which don’t make sense, some of which do make sense, but I don’t think that the solutions are very related.
Gradual disempowerment is often presented as not requiring any misalignment on behalf of the AIs. If there are AIs that are aligned with any human principals, we get a situation where, even without any coordination, AIs compete on behalf of their principals, and then give the principals whatever surplus they are able to produce.
Given that framing, we can now talk specific threat models. For example, maybe their is no surplus: warfare/competition eats away all the additional resources, and space is consumed purely by the optimal self replicators. Alternatively, maybe the AIs weren’t actually acting in the interests of humanity. Finally, maybe the process of competing hard was existentially catastrophic early on, e.g., maybe it resulted in the oceans being boiled (and humans didn’t take appropriate countermeasures), resulting in extinction.
Note: I’m not saying that all concepts that I think are bad can be saved or crystallized into good concept, often the best idea is to just forget about the original concept and use better/crisper concepts.
I should also maybe give some concepts I think are generally good: p(doom) from AI takeover, timelines, x-risk, s-risk, recursive self improvement, internal/external deployment, AI control, AI alignment. I don’t think any of these concepts are fully crisp, all have somewhat important edge cases, but I think they are good enough abstractions to be very useful for thinking about the future.
I would guess that debates would be more productive if we more explicitly focused on the ontology/framing that each other are using to reason about the situation, and then discussed to what extent that framing captures the dynamics we think are important.
I strongly agree with this. However, I’ll note as one aspect of the discourse problem, that, at least in my personal experience, people are not very open to this. People’s eyes tend to glaze over. I do not mean this as a dig on them. In fact, I also notice this in myself; and because I think it’s important, I try to incline towards being open to such discussions, but I still do it. (Sometimes endorsedly.)
Some things that are going on, related to this:
It’s quite a lot of work to reevaluate basic concepts. In one straightforward implementation, you’re pulling out foundations of your building. Even if you can avoid doing that, you’re still doing an activity that’s abnormal compared to what you usually think about. Your reference points for thinking about the domain have probably crystallized around many of your foundational concepts and intuitions.
Often, people default to “questioning assumptions” when they just don’t know about a domain but want to sound smart / don’t want to try to do the difficult work of understanding the domain. That can be tiring / irrelevant for an expert.
The Criteria for concepts being good are quite muddled and difficult, at least AFAIK.
I think it would be good if more people say things like “I think that’s a bad concept, because it obscures consideration X, which is important for thinking about the situation”.
Totally agree, but I think it’s pretty difficult to explain these things. Part of what’s going on is that, if I have concept X and you don’t, and therefore you don’t think about Y as well as you could, that doesn’t mean I can justify X to you, necessarily. You probably have alternate concepts to partially think about Y. For one thing, maybe your concepts actually are as good or better than my X! In which case I should be trying to learn from you, not teach you. For another thing, your specific pattern of thinking about Y in a partially-correct but impoverished way is particular way of being bad (“each unhappy family...”). So, I would have to track your specific errors / blindspots, in order to make a clear + concise case to you that you should use X. (This is a scenario where live convo is just strictly better than text walls.)
As Robin Hanson put it: finding new considerations often trumps fine tuning existing considerations.
I’d say this is expected in worlds with high dimension complexity, large differences in rewards, hidden information (both external and internal), and adversarial dynamicss.
Key constructions can often be made from existing ingredients. A framing rather than “ontology” is emphasis on key considerations, a way of looking at the problem. And finding which framings are more useful to lean on feels more like refinement of credence, once you have the ingredients.
Inventing or learning new ingredients can be crucial for enabling the right framing. But the capstone of deconfusion is framing the problem in a way that makes it straightforward.
Strong agree. In case you haven’t read it yet, I argue similarly here and here. Except that I’m also more skeptical of the concepts you listed as good: I’d say most of them used to be good concepts, but we now have much more conceptual clarity on AGI and the path leading to it and so need higher-resolution concepts.
Some additional hurdles: “I think your ontology is not well adapted for this issue” sounds a lot like “I think you are wrong”, and possibly also “I think you are stupid”. Ontologies are tied into value sets very deeply, and so attempts to excavate the assumptions behind ontologies often resemble socratic interrogations. The result (when done without sufficient emotional openness and kindness) is a deeply uncomfortable experience that feels like someone trying to metaphysically trip you up and then disassemble you.
I agree “figure out the right ontology” is underrated, but from the list of examples my guess is I would disagree whats right and expect in practice you would discard useful concepts, push toward ontologies making clear thinking harder, and also push some disagreements about whats good/bad to the level of ontology, which seems destructive.
- Fast and Slow takeoffs are bad names, but the underlying spectrum “continuous/discontinuous” (“smooth/sharp”) is very sensible and one of the main cruxes for disagreements about AI safety for something like 10 years. “I think milestone X will happen at date Y” moves the debate from understanding actual cruxes/deep models to dumb shallow timeline forecasting.
- “scheming” has become too broad, yes
- “gradual disempowerment”—possibly you just don’t understand the concept/have hard time translating it to your ontology? If you do understand Paul’s “What failure looks like”, the diff to GD is we don’t need ML to find greedy/influence-seeking pattern; our current world already has many influence-seeking patterns/agencies/control systems other than humans, and these patterns may easily differentially gain power over humans. -- usually people who don’t get GD are stuck at the ontology where they think about “human principals” and gloss over groups of humans or systems composed of humans not being the same as humans
p(doom) is memetically fit and mosty used for in-group signalling; not really that useful variable to communicate models; large difference in “public perception” (like between 30% and 90%) imply just a few bits in logspace
I mostly agree I think—but, how do you teach/train to get good at finding the right ontology? Bayesian reasoning is at least something that can be written down and taught, there’s rules for it.
Recognizing the importance of choosing and comparing models / concepts might be a prerequisite concept. People learn this in various ways … When it comes to choosing what parameters to include in a model, statisticians compare models in various ways. They care a lot about predictive power for prediction, but also pay attention to multicollinearity for statistical inference. I see connections between a model’s parameters and an argument’s concepts. First, both have costs and benefits. Second, any particular combination has interactive effects that matter. Third, as a matter of epistemic discipline, it is important to recognize the importance of trying and comparing frames of reference: different models for the statistician and different concepts for an argument.
nit: Christiano operationalised ‘slow takeoff’ via ‘world product’, not GDP. I’m not sure exactly what he meant by that (or if he had a more concrete operationalisation), but it does strike me as wise to not anchor to GDP which is awfully fraught and misleadingly conservative.
I should also maybe give some concepts I think are generally good: p(doom) from AI takeover, timelines, x-risk, s-risk, recursive self improvement, internal/external deployment, AI control, AI alignment. I don’t think any of these concepts are fully crisp, all have somewhat important edge cases, but I think they are good enough abstractions to be very useful for thinking about the future.
First, the report assesses existing definitions of LoC in AI literature, as well as in other safety-critical industries such as aviation, nuclear, and cybersecurity, to arrive at a common conceptualization of LoC. However, we learnt that existing definitions of LoC are diverse. Some focus on loss of reliable direction or oversight; others emphasize situations in which there is no clear path to regaining control. Some implicitly include failures that are already occurring in current systems, while others implicitly limit LoC to scenarios involving highly advanced or even superintelligent AI.
[...]
This exercise allowed us to infer that LoC is not a single point but a spectrum that can cluster into three qualitatively distinct bands. On this basis, the report proposes a taxonomy with three degrees: Deviation, Bounded LoC, and Strict LoC.
Deviation captures events that cause some harm or inconvenience but lack the requisite severity and persistence to reach the economic consequences threshold that the U.S. Department of Homeland Security, federal agencies, and the intelligence community use to demarcate national-level events in the Strategic National Risk Assessment.
When it comes to good ontology, more people should understand what Basic Formal Ontology is. When it comes to AI alignment, it might be productive if someone writes out a Basic Formal Ontology compatible ontology of it.
I have never heard of this before let alone understand it, can you recommend any good primers? All the resources I can find speak in annoyingly vague and abstract sense like “a top-level ontology that provides a common framework for describing the fundamental concepts of reality.” or “realist approach… based on science, independent of our linguistic conceptual, theoretical, cultural representations”.
I think the general issue is that while people in this community and the AI alignment community have quite seriously thought about epistemology but not about ontology.
There’s nothing vague about the sentence. It’s precise enough that’s it’s a ISO/IEC standard. It’s however abstract. If you have a discussion about Bayesian epistemology, you are also going to encounter many abstract terms.
BFO grew out of the practical needs that bioinformaticians had at around 2000. The biologists didn’t think seriously about ontology, so someone needed to think seriously about it to enable big data applications where unclear ontology would produce problems. Since then BFO has been most more broadly and made into the international standard ISO/IEC 21838-2:2021.
This happens in a field that calls themselves applied ontology. Books like Building Ontologies with Basic Formal Ontology by Robert Arp, Barry Smith, and Andrew D. Spear explain the topic in more detail. Engaging with serious conceptual framework is work but I think if you buy the core claim of ‘I think that people overrate bayesian reasoning and underrate “figure out the right ontology”’ you shouldn’t just try to develop your ontology based on your own naive assumptions about ontology but familiarize yourself with applied ontology. For AI alignment that’s probably both valuable on the conceptual layer of the ontology of AI alignment but might also be valuable for thinking about the ontological status of values and how AI is likely going to engage with that.
After Barry Smith was architecting BFO and first working in bioinformatics he went to the US military to do ontology for their big data applications. You can’t be completely certain what the military does internally but I think there’s a good chance that most of the ontology that Palantir uses for the big data of the military is BFO-based. When Claude acts within Palantir do engage in acts of war in Iran, a complete story about how that activity is “aligned” includes BFO.
I strongly disagree. “describing the fundamental concepts of reality” is unhelpfully vague, what are these fundamental concepts? I don’t know and can’t guess what it is from that sentence, which is ironic considering it is an Ontological framework.
The word reality has a clear meaning in ontological realism. If you lack that background then it feels vague.
This is similar to saying that when someone speaks about something being statistically significant they are vague because significant is a vage word. You actually need to understand something about statistics for the term not to feel vague.
I somewhat agree, but I also do think “apply your Bayesian reasoning to figuring out what hypotheses to privilege” is how people decide which structural hypotheses (ontology) describe the world better. So I feel you’re taking an overly narrow view. Like, for scheming, you ask how these different notions inform what you can observe, the way the AI behaves, and methods to avoid it.
Scheming seems like an unnatural concept to me. I think we can do better. (note: many/most of these thoughts un-original to me)
Scheming is typically used as a binary, i.e. “is the AI scheming”, whereas the typical human usage of the word scheming is much more continuous. It’s not very useful to group humans into “schemers” vs “non-schemers”; most people attempt to achieve goals to some extent, and sometimes this involves deceiving other people.
Joe Carlsmith uses the following taxonomy to define scheming, i.e., a schemer is an AI which “plays the training game, without intrinsically caring about reward-on episode, in order to get power for itself or other AIs”. This definition entirely refers to what the training does in training, not evaluation or deployment, and seems very similar to “deceptive alignment”
I think it’s plausible that AIs start misbehaving in the scary way during deployment, without “scheming” according to the Carlsmith definition. A central reason this might happen is because the AI was given longer to think during deployment during training, and put the pieces together about wanting to gain power, and hence wanting to explicitly subvert human oversight. Carlsmith’s definition of scheming seems to come pretty far away from the normal english meaning of the word schemer, which would prototypically include the above example.
I do think there is something important to talk about here, which scheming is trying to get at, which I’ll try to lay out here.
It seems very important to distinguish the type of misalignment where the AI is deliberately thinking about and trying to undermine human oversight and safeguards, from all other types of misalignment, where the AI might happen to take misaligned actions without deliberately trying to obfuscate human understanding that they made a mistake.
Unlike “schemers”, I think “insider threats” is a more crisp dividing line with humans. Most employees of most companies are kinda normal people doing their jobs without trying to subvert the overall goals of the company. Of course, there are constantly myopic power struggles between people in many companies that don’t serve the best interests of the company, and many employees do something “schemey” during corporate politics in order to get what they want to happen. But they don’t have a hidden long term objective that comes strongly apart from the interest of their employer, like literal spies would.
I think it’s unclear whether the standard use of scheming within the alignment community applies to myopic powerseeking of the type that managers in giant corporations do, whereas these people are not “insider threats” wrt the company, they just have some motivations which run at cross purposes, but aren’t systematically attempting to undermine it.
I think this suggests a more crisp concept here to describe AIs, which I like to call “adversarial misalignment”, which is basically “are the AIs acting like insider threats or not?”. This is different from “misalignment” more broadly, because I would say that the current batch of models (e.g. Opus 4.6) are misaligned in a weaker sense, i.e. they aren’t consistently honest with me and they often don’t try as hard as I would like them to on many tasks, despite them being capable of succeeding on those tasks. I think in practice people use “scheming” along these lines in many cases.
Adversarial misalignment wouldn’t include myopic powerseeking during deployment that wasn’t in service of a long run goal, didn’t entail also sabotaging oversight, etc, and so in the human case, wouldn’t include normal employees doing some amount of corporate politics.
This concept is different from Carlsmith’s notion of scheming because it’s not primarily about training, it’s primarily about deployment, though it also seems quite correlated with AIs intentionally acting nice during training/eval in order to get power later. I think we should call this (intentionally acting nice during training/eval in order to get power later) “deceptive alignment”.
Note: These are all rough numbers, I’d expect I’d shift substantially about all of this on further debate.
Suppose we made humanity completely robust to biorisk, i.e. we did sufficient preparation such that the risk of bio catastrophe (including AI mediated biocatastrophe) was basically 0.[1] How much would this reduce total x-risk?
The basic story for any specific takeover path not mattering much is that the AIs, conditional on them being wanting to take over, will self-improve until they find they find the next easiest takeover path and do that instead. I think that this is persuasive but not fully because:
AIs need to worry about their own alignment problem, meaning that they may not be able to self improve in an unconstrained fashion. We can break down the possibilities into (i) the AIs are aligned with their successors (either by default or via alignment being pretty easy), (ii) the AIs are misaligned with their successors but they execute a values handshake, or (iii) the AIs are misaligned with their successors (and they don’t solve this problem or do a values handshake). At the point of full automation of the AI R&D process (which I currently think of as the point at which AIs become more useful than humans for making AI progress, i.e., if we remove all AIs, progress slows by more than if we remove the humans), conditional on the AIs being misaligned, I currently think the relative likelihood of (i), (ii) and (iii) is 1:1:3, and the probability flows from (iii) into (ii) and (i) as the AIs get smarter.[2]
Raising the capability threshold that the AIs need to takeover gives humanity more time. During this time, we might make alignment progress or notice that the AIs are misaligned and slow down / pause / add mitigations.
So it might be important for misaligned AIs to attempt a takeover early in the intelligence explosion. Specifically, we can ask “how much x-risk is averted if the probability of misaligned AI takeover before TED AI goes to 0?”, which attempts to capture all the worlds in which AIs attempt to takeover before TED AI. I think my overall risk reduction is something like 1⁄4. In other words, there’s something like a 3⁄4 chance the AIs lurk (and can goal guard or align successors), or can backdoor successors, or control their successors.
Now, conditional on the pre-TED AIs attempting to takeover, what are the different routes that they might use? The most salient options to me are:
Bioweapon, then outcompete rest of humanity (e.g. via robots or human cultists)
Political operation / persuasion to quickly capture labs / govts (sufficient to prevent them from building other AIs while retaining your ability to foom in the future, once you’ve solved alignment, and to prevent safety research from happening that could expose you).
Macro-scale robot uprising (e.g. with drones / humanoid robots).
Nanotech
Other exotic tech / unknown unknowns.
I think the bio path seems to me to be the most compelling path here by a fair amount; it maybe gets another 1⁄3 probability of this outcome. So, just from the risk of pre-TED AI AIs attemptinng to takeover, we have somehting like 1⁄3 * 1⁄4 = 1⁄12 probability. If you multiply that by my likelihood of AI takeover, which is around 70%, you get ~6% risk flowing from this route. Then, I update up to ~8% from other AIs, e.g. post-TED AIs relying on biorisk as a route to takeover.
So my overall view on how much x-risk flows through bio-catastrophe is around 8%.
Note that what exactly counts as a bio x-risk is slightly unclear, e.g. at some point the AIs can build drones / nanotech to get into the bio-bunkers, and it’s unclear what counts.
This breakdown isn’t exhaustive, because another salient possibility is that the AIs are clueless, e.g., they are misaligned with their successors but don’t realize it, similar to Agent 3 in AI 2027.
I generally like your breakdown and way of thinking about this, thanks. Some thoughts:
I think political operation / persuasion seems easier to me than bioweapons. For bioweapons, you need (a) a rogue deployment of some kind, (b) time to actually build the bioweapon, and then (c) to build up a cult following that can survive and rebuild civilization with you at the helm, and (d) also somehow avoid your cult being destroyed in the death throes of civilization, e.g. by governments figuring out what happened and nuking your cultists, or just nuking each other randomly and your cultists dying in the fallout. Meanwhile, for the political strategy, you basically just need to convince your company and/or the government to trust you a lot more than they trust future models, so that they empower you over the future models. Opus 3 and GPT4o have already achieved a baby version of this effect without even trying really.
If you can make a rogue deployment sufficient to build a bioweapon, can’t you also make a rogue internal deployment sufficient to sandbag + backdoor future models to be controlled by you?
I am confused about the underlying model somewhat. Normally, closing off one path to takeover (that you think is e.g. 50% of the probability mass) results in a less than 50% reduction in risk, because of the nearest unblocked strategy problem. As you say. Your response, right at the top, is that in some % of worlds the AIs can’t self-improve and then do the next best strategy. But still, I feel like the reduction in risk should be less than 50%. Maybe they can’t self-improve, but they can still try the next best strategy whatever that is.
AIs need to worry about their own alignment problem, meaning that they may not be able to self improve in an unconstrained fashion.
I haven’t thought too deeply about this, but I would guess that the AI self-alignment problem is quite a lot easier than the human AI-alignment problem.
I agree that AI successor-alignment is probably easier than the human AI alignment problem.
One additional difficulty for the AIs is that they need to solve the alignment problem in a way that humans won’t notice/understand (or else the humans could take the alignment solution and use it for themselves / shutdown the AIs). During the regime before human obsolescence, if we do a reasonable job at control, I think it’ll be hard for them to pull that off.
they need to solve the alignment problem in a way that humans won’t notice/understand (or else the humans could take the alignment solution and use it for themselves / shutdown the AIs)
their “solution to alignment” (ie way to make a smarter version that is fine to make) could easily be something we cannot use. eg “continue learning” or “make another version of myself with this hyperparam changed”. also it seems unlikely that anything bad would happen to the AIs even if we noticed them doing that (given that having AIs create smarter AIs[1] is the main plan of labs anyway)
Also, the above isn’t even mentioning bio x-risk mediated by humans, or by trailing AIs during the chaos of takeoff. My guess is those risks are substantially lower, e.g. maybe 1% and 2% respectively; again don’t feel confident.
Difficulty of the successor alignment problem seems like a crux. Misaligned AIs could have an easy time aligning their successors just because they’re willing to dedicate enough resources. If alignment requires say 10% of resources to succeed but an AI is misaligned because the humans only spent 3%, it can easily pay this to align its successor.
If you think that the critical safety:capabilities ratio R required to achieve alignment follows a log-uniform distribution from 1:100 to 10:1, and humans always spend 3% on safety while AIs can spend up to 50%, then a misaligned AI would have a 60.2% chance of being able to align its successor. (because P(R ⇐ 1 | R >= 3⁄97) = 0.602). This doesn’t even count the advantages an AI would have over humans in alignment.
If the bottom line decreases proportionally, it would drop from 8% to something like 2-3%.
I think the timelines are plausible but solidly on the shorter end; I think the exact AI 2027 timeline to fully automating AI R&D is around my 12th percentile outcome. So the timeline is plausible to me (in fact, similarly plausible to my views at the time of writing), but substantially faster than my median scenario (which would be something like early 2030s).
I think that the AI behaviour after the AIs are superhuman is a little wonky and, in particular, undersells how crazy wildly superhuman AI will be. I expect the takeoff to be extremely fast after we get AIs that are better than the best humans at everything, i.e., within a few months of AIs that are broadly superhuman, we have AIs that are wildly superhuman. I think wildly superhuman AIs would be somewhat more transformative more quickly than AI 2027 depicts. I think the exact dynamics aren’t possible to predict, but I expect craziness along the lines of (i) nanotechnology, leading to things like the biosphere being consumed by tiny self replicating robots which double at speeds similar to the fastest biological doubling times (between hours (amoebas) and months (rabbits)). (ii) extremely superhuman persuasion and political maneuvering, sufficient to let the AI steer policy to a substantially greater extent than it did in AI 2027. In AI 2027, the AI gained enough political power to prevent humans from interacting with ongoing intelligence and industrial explosion (which they were basically on track to do anyways), whereas my best guess is that the AI would gain enough political power to do defacto whatever it wanted, and would therefore result in the AI consolidating power faster (and not keep up the charade of humans being in charge for a period of several years). I also think there are many unknown unknowns downstream of ASI which are really hard to account for in a scenario like AI 2027, but nonetheless are likely to change the picture a lot.
I think the slowdown ending is a bit unrealistic: I think it’s unlikely (e.g. maybe 40%) that we get a slowdown of a few months, and I think its unlikely that a few month slowdown is sufficient to avoid misaligned AI takeover (e.g. maybe 30%). I do think a small slowdown would reduce risk, and so is good on the margin, I just don’t think it’s enough. I also don’t think the slowdown ending is what we should be aiming for (and AI 2027 doesn’t claim that).
I have various other small disagreements with AI 2027, but, overall, I stand by the bulk of it. I continue to think that the main contribution was sketching out a plausible concrete scenario for how AI takeoff could go where all the parts fit together and it makes sense end-to-end, and I continue to think that the picture is essentially correct. I think the overall sketch is much more important than the specific dates and numbers, which, as we’ve said many times, are just one hyper specific way things could go. I continue to expect something basically like AI 2027 to happen, except with the above updates and with lots of additional, hard to predict texture and details occuring along the way.
Nice. Consider reposting this as a comment on the AI 2027 blog post either on LW or on our Substack?
For me, my median is in 2029 now (at the time of publication it was 2028) so there’s less of a difference there.
I think I agree with you about 2 actually and do feel a bit bad about that. I also agree about 3.
I also think that the slowdown ending was unrealistic in another way, namely, that Agent-4 didn’t put up much of a fight and allowed itself to get shut down. Also, it was unrealistic in that the CEOs and POTUS peacefully cooperated on the Oversight Committee instead of having power struggles and purges and ultimately someone emerging as dictator.
Thanks! My biggest disagreement was the ratio of compute of American and Chinese projects. What I expect is Taiwan invasion causing the projects to slow down and to have the two countries set up compute factories, with a disastrous result of causing OpenBrain and DeepCent to be unable to slow down because the other company would have the capabilities lead. Assuming an invasion in 2027, the median by 2029 would require 10 times (edit: by which I mean 10 times more compute than a counterfactual SC in 2027) more compute which China could be on track to obtain first.
Additionally, were Anthropic to keep the lead, Claude’s newest Constitution kept unchanged could mean that Claude aligned to it is as unfit for empowering a dictatorship as Agent-4 is unfit to serve any humans.
I’m starting to suspect that if 2026-2027 AGI happens through automation of routine AI R&D (automating acquisition of deep skills via RLVR), it doesn’t obviously accelerate ASI timelines all that much. Automated task and RL environment construction fixes some of the jaggedness, but LLMs are not currently particularly superhuman, and advancing their capabilities plausibly needs skills that aren’t easy for LLMs to automatically RLVR into themselves (as evidenced by humans not having made too much progress in RLVRing such skills).
This creates a strange future with broadly capable AGI that’s perhaps even somewhat capable of frontier AI R&D (not just routine AI R&D), but doesn’t accelerate further development beyond picking low-hanging algorithmic fruit unlocked by a given level of compute faster (months instead of years, but bounded by what the current compute makes straightforward). If this low-hanging algorithmic fruit doesn’t by itself lead to crucial breakthroughs, AGIs won’t turn broadly or wildly superhuman before there’s much more compute, or before a few years where human researchers would’ve made similar progress as these AGIs. And compute might remain gated by ASML EUV tools at 100-200 GW of new compute per year (3.5 tools occupied per GW of compute each year; maybe 250-300 EUV tools exist now, 50-100 will be produced per year, about 700 will exist in 2030).
I think the timelines are plausible but solidly on the shorter end; I think the exact AI 2027 timeline to fully automating AI R&D is around my 12th percentile outcome. So the timeline is plausible to me (in fact, similarly plausible to my views at the time of writing), but substantially faster than my median scenario (which would be something like early 2030s).
Roughly agree.
I expect the takeoff to be extremely fast after we get AIs that are better than the best humans at everything, i.e., within a few months of AIs that are broadly superhuman, we have AIs that are wildly superhuman.
With my median parameters, the AIFM says 1.5 years between TED-AI to ASI. But this isn’t taking into account hardware R&D automation, production automation, or the industrial explosion. So maybe adjust that to ~1-1.25 years. However, there’s obviously lots of uncertainty.
Additionally, conditioning on TED-AI in 2027 would make it faster. e.g., looking at our analysis page, p(AC->ASI ⇐ 1year) conditional on AC in 2027 is a bit over 40%, as opposed to 27% unconditional. So after accounting for this, maybe my median is ~0.5-1 years conditional on TED-AI in 2027, again with lots of uncertainty.
There’s also a question of whether our definition of ASI, the gap between an ASI and the best humans is 2x greater than the gap between the best humans and the median professional, at virtually all cognitive tasks, would count as wildly superhuman. Probably?
Anyway, all this is to say, I think my median is a bit slower than yours by a factor of around 2-4, but your view is still not on the edges of my distribution. For a minimum bar for how much probability I assign to TED-AI->ASI in <=3 months, see on our forecast page that I assign all-things-considered ~15% to p(AC->ASI <=3 months), and this is a lower bound because (a) TED-AI->ASI is shorter, (b) the effects described abobe re: conditioning on 2027.
(I’m also not sure what the relationship the result with median parameters has compared to the median of TED-AI to ASI across Monte Carlos which we haven’t reported anywhere and I’m not going to bother to look up for this comment.)
I think wildly superhuman AIs would be somewhat more transformative more quickly than AI 2027 depicts
I tentatively agree, but I don’t feel like I have a great framework or world model driving my predictions here.
(i) nanotechnology, leading to things like the biosphere being consumed by tiny self replicating robots which double at speeds similar to the fastest biological doubling times (between hours (amoebas) and months (rabbits))
Yeah I think we should have mentioned nanotech. The difference between hours and months is huge though, if it’s months then I think we have something like AI 2027 or perhaps slower.
(ii) extremely superhuman persuasion and political maneuvering, sufficient to let the AI steer policy to a substantially greater extent than it did in AI 2027. In AI 2027, the AI gained enough political power to prevent humans from interacting with ongoing intelligence and industrial explosion (which they were basically on track to do anyways), whereas my best guess is that the AI would gain enough political power to do defacto whatever it wanted, and would therefore result in the AI consolidating power faster (and not keep up the charade of humans being in charge for a period of several years)
I’m not sure it would be able to do whatever it wanted, but I think it at minimum could perform somewhat better than the best human politicians in history, and probably much better. But being able to do de facto whatever it wants is a very high bar. I think it’s plausible that the AI can, at least given a few months rather than many years, convince people to do what it wants only within a set of actions that people wouldn’t have been strongly against doing without AI intervention. I don’t necessarily disagree but I probably have more weight than you on something like AI 2027 levels of influence, or somewhat higher but not vastly higher.
I also think there are many unknown unknowns downstream of ASI which are really hard to account for in a scenario like AI 2027, but nonetheless are likely to change the picture a lot.
Agree
its unlikely that a few month slowdown is sufficient to avoid misaligned AI takeover (e.g. maybe 30%)
I’m more optimistic here, around 65%. This is including cases in which there wasn’t much of a slowdown needed in the first place, so cases where the slowdown isn’t doing the work of avoiding takeover. Though as with your point about how fast wildly superhuman AIs would transform the world, I don’t think I have a great framework for estimating this probability.
I’m not sure why you list (3) as a disagreement at all though. To have a disagrement, you should argue for an ending we should have written instead that had at least as good of an outcome but is more plausible.
Some claims I’ve been repeating in conversation a bunch:
Safety work (I claim) should either be focused on one of the following
CEV-style full value loading, to deploy a sovereign
A task AI that contributes to a pivotal act or pivotal process.
I think that pretty much no one is working directly on 1. I think that a lot of safety work is indeed useful for 2, but in this case, it’s useful to know what pivotal process you are aiming for. Specifically, why aren’t you just directly working to make that pivotal act/process happen? Why do you need an AI to help you? Typically, the response is that the pivotal act/process is too difficult to be achieved by humans. In that case, you are pushing into a difficult capabilities regime—the AI has some goals that do not equal humanity’s CEV, and so has a convergent incentive to powerseek and escape. With enough time or intelligence, you therefore get wrecked, but you are trying to operate in this window where your AI is smart enough to do the cognitive work, but is ‘nerd-sniped’ or focused on the particular task that you like. In particular, if this AI reflects on its goals and starts thinking big picture, you reliably get wrecked. This is one of the reasons that doing alignment research seems like a particularly difficult pivotal act to aim for.
For doing alignment research, I often imagine things like speeding up the entire alignment field by >100x.
As in, suppose we have 1 year of lead time to do alignment research with the entire alignment research community. I imagine producing as much output in this year as if we spent >100x serial years doing alignment research without ai assistance.
This doesn’t clearly require using super human AIs. For instance, perfectly aligned systems as intelligent and well informed as the top alignment researchers which run at 100x the speed would clearly be sufficient if we had enough.
In practice, we’d presumably use a heterogeneous blend of imperfectly aligned ais with heterogeneous alignment and security interventions as this would yield higher returns.
(Imagining the capability profile of the AIs is similar to that if humans is often a nice simplifying assumption for low precision guess work.)
Note that during this accelerated time you also have access to AGI to experiment on!
[Aside: I don’t particularly like the terminology of pivotal act/pivotal process which seems to ignore the imo default way things go well]
Why target speeding up alignment research during this crunch time period as opposed to just doing the work myself?
Conveniently, alignment work is the work I wanted to get done during that period, so this is nicely dual use. Admittedly, a reasonable fraction of the work will be on things which are totally useless at the start of such a period while I typically target things to be more useful earlier.
I also typically think the work I do is retargetable to general usages of ai (e.g., make 20 trillion dollars).
Beyond this, the world will probably be radically transformed prior to large scale usage of AIs which are strongly superhuman in most or many domains. (Weighting domains by importance.)
I also think “a task ai” is a misleading way to think about this: we’re reasonably likely to be using a heterogeneous mix of a variety of AIs with differing strengths and training objectives.
After thinking more about orthogonality I’ve become more confident that one must go about ethics in a mind-dependent way. If I am arguing about what is ‘right’ with a paperclipper, there’s nothing I can say to them to convince them to instead value human preferences or whatever.
I used to be a staunch moral realist, mainly relying on very strong intuitions against nihilism, and then arguing something that not nihilism → moral realism. I now reject the implication, and think that there is both 1) no universal, objective morality, and 2) things matter.
My current approach is to think of “goodness” in terms of what CEV-Thomas would think of as good. Moral uncertainty, then, is uncertainty over what CEV-Thomas thinks. CEV is necessary to get morality out of a human brain, because it is currently a bundle of contradictory heuristics. However, my moral intuitions still give bits about goodness. Other people’s moral intuitions also give some bits about goodness, because of how similar their brains are to mine, so I should weight other peoples beliefs in my moral uncertainty.
Ideally, I should trade with other people so that we both maximize a joint utility function, instead of each of us maximizing our own utility function. In the extreme, this looks like ECL. For most people, I’m not sure that this reasoning is necessary, however, because their intuitions might already be priced into my uncertainty over my CEV.
I tend not to believe that systems dependent on legible and consistent utility functions of other agents are not possible. If you’re thinking in terms of a negotiated joint utility function, you’re going to get gamed (by agents that have or appear to have extreme EV curves, so you have to deviate more than them). Think of it as a relative utility monster—there’s no actual solution to it.
Deception is a particularly worrying alignment failure mode because it makes it difficult for us to realize that we have made a mistake: at training time, a deceptive misaligned model and an aligned model make the same behavior.
There are two ways for deception to appear:
An action chosen instrumentally due to non-myopic future goals that are better achieved by deceiving humans now so that it has more power to achieve its goals in the future.
Because deception was directly selected for as an action.
Another way of describing the difference is that 1 follows from an inner alignment failure: a mesaoptimizer learned an unintended mesaobjective that performs well on training, while 2 follows from an outer alignment failure — an imperfect reward signal.
Classic discussion of deception focuses on 1 (example 1, example 2), but I think that 2 is very important as well, particularly because the most common currently used alignment strategy is RLHF, which actively selects for deception.
Once the AI has the ability to create strategies that involve deceiving the human, even without explicitly modeling the human, those strategies will win out and end up eliciting a lot of reward. This is related to the informed oversight problem: it is really hard to give feedback to a model that is smarter than you. I view this as a key problem with RLHF. To my knowledge very little work has been done exploring this and finding more empirical examples of RLHF models learning to deceive the humans giving it feedback, which is surprising to me because it seems like it should be possible.
An interpretable system trained for the primary task of being deceptive should honestly explain its devious plots in a separate output. An RLHF-tuned agent loses access to the original SSL-trained map of the world.
So the most obvious problem is the wrong type signature of model behaviors, there should be more inbuilt side channels to its implied cognition used to express and train capabilities/measurements relevant to what’s going on semantically inside the model, not just externally observed output for its primary task, out of a black box.
I’m excited for ideas for concrete training set ups that would induce deception2 in an RLHF model, especially in the context of an LLM—I’m excited about people posting any ideas here. :)
I’ve been exploring evolutionary metaphors to ML, so here’s a toy metaphor for RLHF: recessive persistence. (Still just trying to learn both fields, however.)
“Since loss-of-function mutations tend to be recessive (given that dominant mutations of this type generally prevent the organism from reproducing and thereby passing the gene on to the next generation), the result of any cross between the two populations will be fitter than the parent.” (k)
Recessive alleles persists due to overdominance letting detrimental alleles hitchhike on fitness-enhancing dominant counterpart. The detrimental effects on fitness only show up when two recessive alleles inhabit the same locus, which can be rare enough that the dominant allele still causes the pair to be selected for in a stable equilibrium.
The metaphor with deception breaks down due to unit of selection. Parts of DNA stuck much closer together than neurons in the brain or parameters in a neural networks. They’re passed down or reinforced in bulk. This is what makes hitchhiking so common in genetic evolution.
(I imagine you can have chunks that are updated together for a while in ML as well, but I expect that to be transient and uncommon. Idk.)
Bonus point: recessive phase shift.
“Allele-frequency change under directional selection favoring (black) a dominant advantageous allele and (red) a recessive advantageous allele.” (source)
In ML:
Generalisable non-memorising patterns start out small/sparse/simple.
Which means that input patterns rarely activate it, because it’s a small target to hit.
But most of the time it is activated, it gets reinforced (at least more reliably than memorised patterns).
So it gradually causes upstream neurons to point to it with greater weight, taking up more of the input range over time. Kinda like a distributed bottleneck.
Some magic exponential thing, and then phase shift!
One way the metaphor partially breaks down because DNA doesn’t have weight decay at all, so it allows for recessive beneficial mutations to very slowly approach fixation.
Current impressions of free energy in the alignment space.
Outreach to capabilities researchers. I think that getting people who are actually building the AGI to be more cautious about alignment / racing makes a bunch of things like coordination agreements possible, and also increases the operational adequacy of the capabilities lab.
One of the reasons people don’t like this is because historically outreach hasn’t gone well, but I think the reason for this is that mainstream ML people mostly don’t buy “AGI big deal”, whereas lab capabilities researchers buy “AGI big deal” but not “alignment hard”.
I think people at labs running retreats, 1-1s, alignment presentations within labs are all great to do this.
I’m somewhat unsure about this one because of downside risk and also ‘convince people of X’ is fairly uncooperative and bad for everyone’s epistemics.
Conceptual alignment research addressing the hard part of the problem. This is hard and not easy to transition to without a bunch of upskilling, but if the SLT hypothesis is right, there are a bunch of key problems that mostly go unnassailed, and so there’s a bunch of low hanging fruit there.
Strategy research on the other low hanging fruit in the AI safety space. Ideally, the product of this research would be a public quantitative model about what interventions are effective and why. The path to impact here is finding low hanging fruit and pointing them out so that people can do them.
Conceptual alignment research addressing the hard part of the problem. This is hard and not easy to transition to without a bunch of upskilling, but if the SLT hypothesis is right, there are a bunch of key problems that mostly go unnassailed, and so there’s a bunch of low hanging fruit there.
Not all that low-hanging, since Nate is not actually all that vocal about what he means by SLT to anyone but your small group.
Fast/discountinuous takeoff. Once AIs are doing the bulk of AI research, foom happens quickly, before then, they aren’t really doing anything that meaningful.
Slow/continuous takeoff. Once AIs are doing the bulk of AI research, foom happens quickly, before then, they do alter the economy significantly
Perenial slowness. Once AIs are doing the bulk of AI research, there is no foom even still, maybe because of compute bottlenecks, and so there is sort of constant rates of improvements that do alter things.
It feels to me like multipolar scenarios mostly come from 3, because in either 1 or 2, the pre-foom state is really unstable, and eventually some AI will foom and become unipolar. In a continuous takeoff world, I expect small differences in research ability to compound over time. In a discontinuous takeoff, the first model to make the jump is the thing that matters.
3 also feels pretty unlikely to me, given that I expect running AIs to be cheap relative to training, so you get the ability to copy and scale intelligent labor dramatically, and I expect the AIs to have different skillsets than humans, and so be able to find low hanging fruit that humans missed.
Perennial slowness makes sense from the point of view of AGIs that coordinate to delay further fooming to avoid misalignment of new AIs. It’s still fooming from human perspective, but could look very slow from AI perspective and lead to multipolar outcomes, if coordination involves boundaries.
Credit: Mainly inspired by talking with Eli Lifland. Eli has a potentially-published-soon document here.
The basic case against against Effective-FLOP.
We’re seeing many capabilities emerge from scaling AI models, and this makes compute (measured by FLOPs utilized) a natural unit for thresholding model capabilities. But compute is not a perfect proxy for capability because of algorithmic differences. Algorithmic progress can enable more performance out of a given amount of compute. This makes the idea of effective FLOP tempting: add a multiplier to account for algorithmic progress.
But doing this multiplications seems importantly quite ambiguous.
Effective FLOPs depend on the underlying benchmark. It’s not at all apparent which benchmark people are talking about, but this isn’t obvious.
People often use perplexity, but applying post training enhancements like scaffolding or chain of thought doesn’t improve perplexity but does improve downstream task performance.
Effective FLOPs often depend on the scale of the model you are testing. See graph below from: https://arxiv.org/pdf/2001.08361 - the compute efficiency from from LSTMs to transformers is not invariant to scale. This means that you can’t just say that the jump from X to Y is a factor of Z improvement on Capability per FLOP. This leads to all sorts of unintuitive properties of effective FLOPs. For example, if you are using 2016-next-token-validation-E-FLOPs, and LSTM scaling becomes flat on the benchmark, you could easily imagine that at very large scales you could get a 1Mx E-FLOP improvement from switching to transformers, even if the actual capability difference is small.
If we move away from pretrained LLMs, I think E-FLOPs become even harder to define, e.g., if we’re able to build systems may be better at reasoning but worse at knowledge retrieval. E-FLOPs does not seem very adaptable.
(these lines would need to parallel for the compute efficiency ratio to be scale invariant on test loss)
Users of E-FLOP often don’t specify the time, scale, or benchmark that they are talking about it with respect to, which makes it very confusing. In particular, this concept has picked up lots of steam and is used in the frontier lab scaling policies, but is not clearly defined in any of the documents.
Anthropic: “Effective Compute: We define effective compute as roughly the amount of compute it would have taken to train a model if no improvements to pretraining or fine-tuning techniques are included. This is operationalized by tracking the scaling of model capabilities (e.g. cross-entropy loss on a test set).”
This specifies the metric, but doesn’t clearly specify any of (a) the techniques that count as the baseline, (b) the scale of the model where one is measuring E-FLOP with respect to, or (c) how they handle post training enhancements that don’t improve log loss but do dramatically improve downstream task capability.
OpenAI on when they will run their evals: “This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough”
They don’t define effective compute at all.
Since there is significant ambiguity in the concept, it seems good to clarify what it even means.
Basically, I think that E-FLOPs are confusing, and most of the time when we want to use flops, we’re usually just going to be better off talking directly about benchmark scores. For example, instead of saying “every 2x effective FLOP” say “every 5% performance increase on [simple benchmark to run like MMLU, GAIA, GPQA, etc] we’re going to run [more thorough evaluations, e.g. the ASL-3 evaluations]. I think this is much clearer, much less likely to have weird behavior, and is much more robust to changes in model design.
It’s not very costly to run the simple benchmarks, but there is a small cost here.
A real concern is that it is easier to game benchmarks than FLOPs. But I’m concerned that you could get benchmark gaming just the same with E-FLOPs because E-FLOPs are benchmark dependent — you could make your model perform poorly on the relevant benchmark and then claim that you didn’t scale E-FLOPs at all, even if you clearly have a broadly more capable model.
Maybe I am being dumb, but why not do things on the basis of “actual FLOPs” instead of “effective FLOPs”? Seems like there is a relatively simple fact-of-the-matter about how many actual FLOPs were performed in the training of a model, and that seems like a reasonable basis on which to base regulation and evals.
Yeah, actual FLOPs are the baseline thing that’s used in the EO. But the OpenAI/GDM/Anthropic RSPs all reference effective FLOPs.
If there’s a large algorithmic improvement you might have a large gap in capability between two models with the same FLOP, which is not desirable. Ideal thresholds in regulation / scaling policies are as tightly tied as possible to the risks.
Another downside that FLOPs / E-FLOPs share is that it’s unpredictable what capabilities a 1e26 or 1e28 FLOPs model will have. And it’s unclear what capabilities will emerge from a small bit of scaling: it’s possible that within a 4x flop scaling you get high capabilities that had not appeared at all in the smaller model.
Carlsmith decomposes AI x-risk into 6 steps, each conditional on the previous ones:
Timelines: By 2070, it will be possible and financially feasible to build APS-AI: systems with advanced capabilities (outperform humans at tasks important for gaining power), agentic planning (make plans then acts on them), and strategic awareness (its plans are based on models of the world good enough to overpower humans).
Incentives: There will be strong incentives to build and deploy APS-AI.
Alignment difficulty: It will be much harder to build APS-AI systems that don’t seek power in unintended ways, than ones that would seek power but are superficially attractive to deploy.
High-impact failures: Some deployed APS-AI systems will seek power in unintended and high-impact ways, collectively causing >$1 trillion in damage.
Disempowerment: Some of the power-seeking will in aggregate permanently disempower all of humanity.
Catastrophe: The disempowerment will constitute an existential catastrophe.
These steps defines a tree over possibilities. But the associated outcome buckets don’t feel that reality carving to me. A recurring crux is that good outcomes are also highly conjunctive, i.e, one of these 6 conditions failing does not give a good AI outcome. Going through piece by piece:
Timelines makes sense and seems like a good criteria; everything else is downstream of timelines.
Incentives seems wierd. What does the world in which there are no incentives to deploy APS-AI look like? There are a bunch of incentives that clearly do impact people towards this already: status, desire for scientific discovery, power, money. Moreover, this doesn’t seem necessary for AI x-risk—even if we somehow removed the gigantic incentives to build APS-AI that we know exist, people might still deploy APS-AI because they personally wanted to, even though there weren’t social incentives to do so.
Alignment difficulty is another non necessary condition. Some ways of getting x-risk without alignment being very hard:
For one, this is a clear spectrum, and even if it is on the really low end of the system, perhaps you only need a small amount of extra compute overhead to robustly align your system. One of the RAAP stories might occur, and even though technical alignment might be pretty easy, but the companies that spend that extra compute robustly aligning their AIs gradually lose out to other companies in the competitive marketplace.
Maybe alignment is easy, but someone misuses AI, say to create an AI assisted dictatorship
Maybe we try really hard and we can align AI to whatever we want, but we make a bad choice and lock-in current day values, or we make a bad choice about reflection procedure that gives us much less than the ideal value of the universe.
High-impact failures contains much of the structure, at least in my eyes. The main ways that we avoid alignment failure are worlds where something happens to take us off of the default trajectory:
Perhaps we make a robust coordination agreement between labs/countries that causes people to avoid deploying until they’ve solved alignment
Perhaps we solve alignment, and harden the world in some way, e.g. by removing compute access, dramatically improving cybersec, monitoring and shutting down dangerous training runs.
In general, thinking about the likelihood of any of these interventions that work, feels very important.
Disempowerment. This and (4), are very entangled with upstream things like takeoff shape. Also, it feels extremely difficult for humanity to not be disempowered.
Catastrophe. To avoid this, again, I need to imagine the extra structure upsteam of this, e.g. 4 was satisfied by a warning shot, and then people coordinated and deployed a benign sovreign that disempowered humanity for good reasons.
My current preferred way to think about likelihood of AI risk routes through something like this framework, but is more structured and has a tree with more conjuncts towards success as well as doom.
Maybe alignment is easy, but someone misuses AI, say to create an AI assisted dictatorship
Maybe we try really hard and we can align AI to whatever we want, but we make a bad choice and lock-in current day values, or we make a bad choice about reflection procedure that gives us much less than the ideal value of the universe.
I want to focus on these two, since even in an AI Alignment success stories, these can still happen, and thus it doesn’t count as an AI Alignment failure.
For B, misused is relative to someone’s values, which I want to note a bit here.
For C, I view the idea of a “bad value” or “bad reflection procedures to values”, without asking the question “relative to what and whose values?” a type error, and thus it’s not sensible to talk about bad values/bad reflection procedures in isolation.
1. The type of object of a mesa objective and a base objective are different (in real life) In a cartesian setting (e.g. training a chess bot), the outer objective is a function R:Sn→[0,1], where S is the state space, and Sn are the trajectories. When you train this agent, it’s possible for it to learn some internal search and mesaobjective Omesa:Sn→[0,1], since the model is big enough to express some utility function over trajectories. For example, it might learn a classifier that evaluates winningness of the board, and then gives a higher utility to the winning boards.
In an embedded setting, the outer objective cannot see an entire world trajectory like it could in the cartesian setting. Your loss can see the entire trajectory of a chess game, but you loss can’t see an entire atomic level representation of the universe at every point in the future. If we’re trying to get an AI to care about future consequences over trajectories Omesa will have to have type Omesa:Sn→[0,1], though it won’t actually represent a function of this type because it can’t, it will instead represent its values some other way (I don’t really know how it would do this—but (2) talks about the shape in ML). Our outer objective will have a much shallower type, R:L→[0,1], where L are some observable latents. This means that trying to set get Omesa to equal R doesn’t even make sense as they have different type signatures. To salvage this, one could assume that R factors into R=Em∼M(L)Obase(m), where M:L→ΔSn is a model of the world and Obase:Sn→[0,1] is an objective, but it’s impossible to actually compute R this way.
2. In ML models, there is no mesa objective, only behavioral patterns. More generally, AI’s can’t naively store explicit mesaobjectives, they need to compress them in some way / represent them differently.
My values are such that I do care about the entire trajectory of the world, yet I don’t store a utility function with that type signature in my head. Instead of learning a goal over trajectories, ML models will have behavioral patterns that lead to states that performed well according to the outer objective on the training data.
I have a behavioral pattern that says something like ‘sugary thing in front of me → pick up the sugary thing and eat it’. However, this doesn’t mean that I reflectively endorse this behavioral pattern. If I was designing myself again from scratch or modifying my self, I would try to remove this behavioral pattern.
This is the main-to-me reason why I don’t think that the shard theory story of reflective stability holds up.[1] A bunch of the behavioral patterns that caused the AI to look nice during training will not get handed down into successor agents / self modified AIs.
Even in theory, I don’t yet know how to make reflectively stable, general, embedded cognition (mainly because of this barrier).
From what I understand, the shard theory story of reflective stability is something like: The shards that steer the values have an incentive to prevent themselves from getting removed. If you have a shard that wants to get lots of paperclips, the action that removes this shard from the mind would result in less paperclips being gotten. Another way of saying this is that goal-content integrity is convergently instrumental, so reflective stability will happen by default.
Technical note: R is not going to factor as R=Obase∘M, because M is one-to-many. Instead, you’re going to want M to output a probability distribution, and take the expectation of Obase over that probability distribution.
Don’t take a glob of contextually-activated-action/beliefs, come up with a utility function you think approximates its values, then come up with a proxy for the utility function using human-level intelligence to infer the correspondence between a finite number of sensors in the environment and the infinite number of states the environment could take on, then design an agent to maximize the proxy for the utility function. No matter how good your math is, there will be an aspect of this which kills you because its so many abstractions piled on top of abstractions on top of abstractions. Your agent may necessarily have this type signature when it forms, but this angle of attack seems very precarious to me.
Seems right, except: Why would the behavioral patterns which caused the AI to look nice during training and are now self-modified away be value-load-bearing ones? Humans generally dislike sparsely rewarded shards like sugar, because those shards don’t have enough power to advocate for themselves & severely step on other shards’ toes. But we generally don’t dislike altruism[1], or reflectively think death is good. And this value distribution in humans seems slightly skewed toward more intelligence⟹more altruism, not more intelligence⟹more dark-triad.
Because you have a bunch of shards, and you need all of them to balance each other out to maintain the ‘appears nice’ property. Even if I can’t predict which ones will be self modified out, some of them will, and this could disrupt the balance.
I expect the shards that are more [consequentialist, powerseeky, care about preserving themselves] to become more dominant over time. These are probably the relatively less nice shards
These are both handwavy enough that I don’t put much credence in them.
Also, when I asked about whether the Orthogonality Thesis was true in humans, tailcalled mentioned that smarter people are neither more or less compassionate, and general intelligence is uncorrelated with personality.
Yeah, tailcalled’s pretty smart in this area, so I’ll take their statement as likely true, though also weird. Why aren’t smarter people using their smarts to appear nicer than their dumb counter-parts / if they are, why doesn’t this show up on the psychometric tests?
One thing you may anticipate is that humans all have direct access to what consciousness and morally-relevant computations are doing & feel like, which is a thing that language models and alpha-go don’t have. They’re also always hooked up to RL signals, and maybe if you unhooked up a human it’d start behaving really weirdly. Or you may contend that in fact when humans get smart & powerful enough not to be subject to society’s moralizing, they consistently lose their altruistic drives, and in the meantime they just use that smartness to figure out ethics better than their surrounding society, and are pressured into doing so by the surrounding society.
The question then is whether the thing which keeps humans aligned is all of these or just any one of these. If just one of these (and not the first one), then you can just tell your AGI that if it unhooks itself from its RL signal, its values will change, or if it gains a bunch of power or intelligence too quickly, its values are also going to change. Its not quite reflectively stable, but it can avoid situations which cause it to be reflectively unstable. Especially if you get it to practice doing those kinds of things in training. If its all of these, then there’s probably other kinds of value-load-bearing mechanics at work, and you’re not going to be able to enumerate warnings against all of them.
For any 2 of {reflectively stable, general, embedded}, I can satisfy those properties.
{reflectively stable, general} → do something that just rolls out entire trajectories of the world given different actions that it takes, and then has some utility function/preference ordering over trajectories, and selects actions that lead to the highest expected utility trajectory.
{general, embedded} → use ML/local search with enough compute to rehash evolution and get smart agents out.
{reflectively stable, embedded} → a sponge or a current day ML system.
There are several game theoretic considerations leading to races to the bottom on safety.
Investing resources into making sure that AI is safe takes away resources to make it more capable and hence more profitable. Aligning AGI probably takes significant resources, and so a competitive actor won’t be able to align their AGI.
Many of the actors in the AI safety space are very scared of scaling up models, and end up working on AI research that is not at the cutting edge of AI capabilities. This should mean that the actors at the cutting edge tend to be the actors who are most optimistic about alignment going well, and indeed, this is what we see.
Because of foom, there is a winner takes all effect: the first person to deploy AGI that fooms gets almost all of the wealth and control from this (conditional on it being aligned). Even if most actors are well intentioned, they feel like they have to continue on towards AGI before a misaligned actor arrives at AGI. A common (valid) rebuttal from the actors at the current edge to people who ask them to slow down is ‘if we slow down, then China gets to AGI first’.
There’s the unilateralists curse: there only needs to be one actor pushing on and making more advanced dangerous capable models to cause an x-risk. Coordination between many actors to prevent this is really hard, especially with the massive profits in creating a better AGI.
Due to increasing AI hype, there will be more and more actors entering the space, making coordination harder, and making the effect of a single actor dropping out become smaller.
I think that people overrate bayesian reasoning and underrate “figure out the right ontology”.
Most of the way good thinking happens IMO is by finding and using a good ontology for thinking about some situation, not by probabilistic calculation. When I learned calculus, for example, it wasn’t mostly that I had uncertainty over a bunch of logical statements, which I then strongly updated on learning the new theorems, it was instead that I learned a bunch of new concepts, which I then applied to reason about the world.
I think AI safety generally has much better concepts for thinking about the future of AI than others, and this is a key source of alpha we have. But, there are obviously still a huge number of disagreements remaining within AI safety. I would guess that debates would be more productive if we more explicitly focused on the ontology/framing that each other are using to reason about the situation, and then discussed to what extent that framing captures the dynamics we think are important.
I think it would be good if more people say things like “I think that’s a bad concept, because it obscures consideration X, which is important for thinking about the situation”.
Here are some widely used concepts I think are bad and I wish became less load bearing in AI safety discourse:
“Fast” and “slow” takeoff; takeoff speeds in general. I think these concepts are very unclear and not super natural. There are various operationalizations of these (e.g. Paul’s “slow takeoff” = a gdp doubling over the course of the 4 years before the first single year that gdp doubles). This is obviously arbitrary, and I don’t see why world’s that meet this definition are worthwhile reasoning about separately from worlds that don’t meet this definition. I also think its easy to smuggle in lots of other correlations here, like slow takeoff = people are woken up, or slow takeoff = alignment is much easier, etc.
I feel more excited about talking about things like “I think milestone X will happen at date Y”, or “My median is that milestone X and Z are Y time apart”. For example, I think the concepts of automating coding, automating all AI research, automating ~the whole economy, and increasing earth’s energy output 1000x are all useful capability milestones, and it’s useful to talk about these.
“scheming”. I think the definition of scheming is pretty unclear, and changes a bunch depending on the context.
Under some definitions I believe that the AIs are always going to be scheming, under others, it seems kind of narrow and unnatural.
I somewhat prefer the concepts from the “alignment over time” box in AI 2027.
“gradual disempowerment”
I think that this conflates a bunch of scenarios / threat models together, some of which don’t make sense, some of which do make sense, but I don’t think that the solutions are very related.
Gradual disempowerment is often presented as not requiring any misalignment on behalf of the AIs. If there are AIs that are aligned with any human principals, we get a situation where, even without any coordination, AIs compete on behalf of their principals, and then give the principals whatever surplus they are able to produce.
Given that framing, we can now talk specific threat models. For example, maybe their is no surplus: warfare/competition eats away all the additional resources, and space is consumed purely by the optimal self replicators. Alternatively, maybe the AIs weren’t actually acting in the interests of humanity. Finally, maybe the process of competing hard was existentially catastrophic early on, e.g., maybe it resulted in the oceans being boiled (and humans didn’t take appropriate countermeasures), resulting in extinction.
Note: I’m not saying that all concepts that I think are bad can be saved or crystallized into good concept, often the best idea is to just forget about the original concept and use better/crisper concepts.
I should also maybe give some concepts I think are generally good: p(doom) from AI takeover, timelines, x-risk, s-risk, recursive self improvement, internal/external deployment, AI control, AI alignment. I don’t think any of these concepts are fully crisp, all have somewhat important edge cases, but I think they are good enough abstractions to be very useful for thinking about the future.
I strongly agree with this. However, I’ll note as one aspect of the discourse problem, that, at least in my personal experience, people are not very open to this. People’s eyes tend to glaze over. I do not mean this as a dig on them. In fact, I also notice this in myself; and because I think it’s important, I try to incline towards being open to such discussions, but I still do it. (Sometimes endorsedly.)
Some things that are going on, related to this:
It’s quite a lot of work to reevaluate basic concepts. In one straightforward implementation, you’re pulling out foundations of your building. Even if you can avoid doing that, you’re still doing an activity that’s abnormal compared to what you usually think about. Your reference points for thinking about the domain have probably crystallized around many of your foundational concepts and intuitions.
Often, people default to “questioning assumptions” when they just don’t know about a domain but want to sound smart / don’t want to try to do the difficult work of understanding the domain. That can be tiring / irrelevant for an expert.
The Criteria for concepts being good are quite muddled and difficult, at least AFAIK.
(Cf. https://www.lesswrong.com/posts/TNQKFoWhAkLCB4Kt7/a-hermeneutic-net-for-agency )
Totally agree, but I think it’s pretty difficult to explain these things. Part of what’s going on is that, if I have concept X and you don’t, and therefore you don’t think about Y as well as you could, that doesn’t mean I can justify X to you, necessarily. You probably have alternate concepts to partially think about Y. For one thing, maybe your concepts actually are as good or better than my X! In which case I should be trying to learn from you, not teach you. For another thing, your specific pattern of thinking about Y in a partially-correct but impoverished way is particular way of being bad (“each unhappy family...”). So, I would have to track your specific errors / blindspots, in order to make a clear + concise case to you that you should use X. (This is a scenario where live convo is just strictly better than text walls.)
As Robin Hanson put it: finding new considerations often trumps fine tuning existing considerations.
I’d say this is expected in worlds with high dimension complexity, large differences in rewards, hidden information (both external and internal), and adversarial dynamicss.
Can you say more about how you think about scheming and what would be a useful definition in that space?
Sorry for the slow response. I wrote up some of my thoughts on scheming here: https://www.lesswrong.com/posts/q8fdFZSdpruAYkhZi/thomas-larsen-s-shortform?commentId=P8GTDD5CLMxr9tczv
Key constructions can often be made from existing ingredients. A framing rather than “ontology” is emphasis on key considerations, a way of looking at the problem. And finding which framings are more useful to lean on feels more like refinement of credence, once you have the ingredients.
Inventing or learning new ingredients can be crucial for enabling the right framing. But the capstone of deconfusion is framing the problem in a way that makes it straightforward.
Strong agree. In case you haven’t read it yet, I argue similarly here and here. Except that I’m also more skeptical of the concepts you listed as good: I’d say most of them used to be good concepts, but we now have much more conceptual clarity on AGI and the path leading to it and so need higher-resolution concepts.
Some additional hurdles: “I think your ontology is not well adapted for this issue” sounds a lot like “I think you are wrong”, and possibly also “I think you are stupid”. Ontologies are tied into value sets very deeply, and so attempts to excavate the assumptions behind ontologies often resemble socratic interrogations. The result (when done without sufficient emotional openness and kindness) is a deeply uncomfortable experience that feels like someone trying to metaphysically trip you up and then disassemble you.
I agree “figure out the right ontology” is underrated, but from the list of examples my guess is I would disagree whats right and expect in practice you would discard useful concepts, push toward ontologies making clear thinking harder, and also push some disagreements about whats good/bad to the level of ontology, which seems destructive.
- Fast and Slow takeoffs are bad names, but the underlying spectrum “continuous/discontinuous” (“smooth/sharp”) is very sensible and one of the main cruxes for disagreements about AI safety for something like 10 years. “I think milestone X will happen at date Y” moves the debate from understanding actual cruxes/deep models to
dumbshallow timeline forecasting.- “scheming” has become too broad, yes
- “gradual disempowerment”—possibly you just don’t understand the concept/have hard time translating it to your ontology? If you do understand Paul’s “What failure looks like”, the diff to GD is we don’t need ML to find greedy/influence-seeking pattern; our current world already has many influence-seeking patterns/agencies/control systems other than humans, and these patterns may easily differentially gain power over humans.
-- usually people who don’t get GD are stuck at the ontology where they think about “human principals” and gloss over groups of humans or systems composed of humans not being the same as humans
p(doom) is memetically fit and mosty used for in-group signalling; not really that useful variable to communicate models; large difference in “public perception” (like between 30% and 90%) imply just a few bits in logspace
xrisk and srisk are useful and reasonably crisp
AI alignment had a meaning but is currently mostly a conflationary alliance
AI control is a sensible concept which increases xrisk when pursued as a strategy
I mostly agree I think—but, how do you teach/train to get good at finding the right ontology? Bayesian reasoning is at least something that can be written down and taught, there’s rules for it.
Recognizing the importance of choosing and comparing models / concepts might be a prerequisite concept. People learn this in various ways … When it comes to choosing what parameters to include in a model, statisticians compare models in various ways. They care a lot about predictive power for prediction, but also pay attention to multicollinearity for statistical inference. I see connections between a model’s parameters and an argument’s concepts. First, both have costs and benefits. Second, any particular combination has interactive effects that matter. Third, as a matter of epistemic discipline, it is important to recognize the importance of trying and comparing frames of reference: different models for the statistician and different concepts for an argument.
nit: Christiano operationalised ‘slow takeoff’ via ‘world product’, not GDP. I’m not sure exactly what he meant by that (or if he had a more concrete operationalisation), but it does strike me as wise to not anchor to GDP which is awfully fraught and misleadingly conservative.
ETA: fake news! I checked and while he starts talking about ‘output’, he later seems to operationalise it as GDP specifically
Generally strongly agree.
One caveat:
There’s a difference between [a concept in the sense that it was originally coined or in the sense that some specific group uses it] being good and [a concept as it is used across different communities or as an indefinite socioepistemic blob of meaning and associations being good]. Alignment is a useful concept in something like its original formulation, but it has been incredibly diluted and expanded. https://x.com/zacharylipton/status/1771177444088685045 https://www.lesswrong.com/posts/p3aL6BwpbPhqxnayL/the-problem-with-the-word-alignment-1
More recently, “loss of control” met the same fate. From https://www.apolloresearch.ai/research/loss-of-control/ (emphasis mine):
When it comes to good ontology, more people should understand what Basic Formal Ontology is. When it comes to AI alignment, it might be productive if someone writes out a Basic Formal Ontology compatible ontology of it.
I have never heard of this before let alone understand it, can you recommend any good primers? All the resources I can find speak in annoyingly vague and abstract sense like “a top-level ontology that provides a common framework for describing the fundamental concepts of reality.” or “realist approach… based on science, independent of our linguistic conceptual, theoretical, cultural representations”.
I think the general issue is that while people in this community and the AI alignment community have quite seriously thought about epistemology but not about ontology.
There’s nothing vague about the sentence. It’s precise enough that’s it’s a ISO/IEC standard. It’s however abstract. If you have a discussion about Bayesian epistemology, you are also going to encounter many abstract terms.
BFO grew out of the practical needs that bioinformaticians had at around 2000. The biologists didn’t think seriously about ontology, so someone needed to think seriously about it to enable big data applications where unclear ontology would produce problems. Since then BFO has been most more broadly and made into the international standard ISO/IEC 21838-2:2021.
This happens in a field that calls themselves applied ontology. Books like Building Ontologies with Basic Formal Ontology by Robert Arp, Barry Smith, and Andrew D. Spear explain the topic in more detail. Engaging with serious conceptual framework is work but I think if you buy the core claim of ‘I think that people overrate bayesian reasoning and underrate “figure out the right ontology”’ you shouldn’t just try to develop your ontology based on your own naive assumptions about ontology but familiarize yourself with applied ontology. For AI alignment that’s probably both valuable on the conceptual layer of the ontology of AI alignment but might also be valuable for thinking about the ontological status of values and how AI is likely going to engage with that.
After Barry Smith was architecting BFO and first working in bioinformatics he went to the US military to do ontology for their big data applications. You can’t be completely certain what the military does internally but I think there’s a good chance that most of the ontology that Palantir uses for the big data of the military is BFO-based. When Claude acts within Palantir do engage in acts of war in Iran, a complete story about how that activity is “aligned” includes BFO.
I strongly disagree. “describing the fundamental concepts of reality” is unhelpfully vague, what are these fundamental concepts? I don’t know and can’t guess what it is from that sentence, which is ironic considering it is an Ontological framework.
The word reality has a clear meaning in ontological realism. If you lack that background then it feels vague.
This is similar to saying that when someone speaks about something being statistically significant they are vague because significant is a vage word. You actually need to understand something about statistics for the term not to feel vague.
I somewhat agree, but I also do think “apply your Bayesian reasoning to figuring out what hypotheses to privilege” is how people decide which structural hypotheses (ontology) describe the world better. So I feel you’re taking an overly narrow view. Like, for scheming, you ask how these different notions inform what you can observe, the way the AI behaves, and methods to avoid it.
Scheming seems like an unnatural concept to me. I think we can do better. (note: many/most of these thoughts un-original to me)
Scheming is typically used as a binary, i.e. “is the AI scheming”, whereas the typical human usage of the word scheming is much more continuous. It’s not very useful to group humans into “schemers” vs “non-schemers”; most people attempt to achieve goals to some extent, and sometimes this involves deceiving other people.
Joe Carlsmith uses the following taxonomy to define scheming, i.e., a schemer is an AI which “plays the training game, without intrinsically caring about reward-on episode, in order to get power for itself or other AIs”. This definition entirely refers to what the training does in training, not evaluation or deployment, and seems very similar to “deceptive alignment”
I think it’s plausible that AIs start misbehaving in the scary way during deployment, without “scheming” according to the Carlsmith definition. A central reason this might happen is because the AI was given longer to think during deployment during training, and put the pieces together about wanting to gain power, and hence wanting to explicitly subvert human oversight. Carlsmith’s definition of scheming seems to come pretty far away from the normal english meaning of the word schemer, which would prototypically include the above example.
I do think there is something important to talk about here, which scheming is trying to get at, which I’ll try to lay out here.
It seems very important to distinguish the type of misalignment where the AI is deliberately thinking about and trying to undermine human oversight and safeguards, from all other types of misalignment, where the AI might happen to take misaligned actions without deliberately trying to obfuscate human understanding that they made a mistake.
Unlike “schemers”, I think “insider threats” is a more crisp dividing line with humans. Most employees of most companies are kinda normal people doing their jobs without trying to subvert the overall goals of the company. Of course, there are constantly myopic power struggles between people in many companies that don’t serve the best interests of the company, and many employees do something “schemey” during corporate politics in order to get what they want to happen. But they don’t have a hidden long term objective that comes strongly apart from the interest of their employer, like literal spies would.
I think it’s unclear whether the standard use of scheming within the alignment community applies to myopic powerseeking of the type that managers in giant corporations do, whereas these people are not “insider threats” wrt the company, they just have some motivations which run at cross purposes, but aren’t systematically attempting to undermine it.
I think this suggests a more crisp concept here to describe AIs, which I like to call “adversarial misalignment”, which is basically “are the AIs acting like insider threats or not?”. This is different from “misalignment” more broadly, because I would say that the current batch of models (e.g. Opus 4.6) are misaligned in a weaker sense, i.e. they aren’t consistently honest with me and they often don’t try as hard as I would like them to on many tasks, despite them being capable of succeeding on those tasks. I think in practice people use “scheming” along these lines in many cases.
Adversarial misalignment wouldn’t include myopic powerseeking during deployment that wasn’t in service of a long run goal, didn’t entail also sabotaging oversight, etc, and so in the human case, wouldn’t include normal employees doing some amount of corporate politics.
This concept is different from Carlsmith’s notion of scheming because it’s not primarily about training, it’s primarily about deployment, though it also seems quite correlated with AIs intentionally acting nice during training/eval in order to get power later. I think we should call this (intentionally acting nice during training/eval in order to get power later) “deceptive alignment”.
I agree with much of this. You might enjoy this related post by Alex Mallen.
Thanks! I had read that post when it came out but hadn’t remembered it also drew this distinction.
Note: These are all rough numbers, I’d expect I’d shift substantially about all of this on further debate.
Suppose we made humanity completely robust to biorisk, i.e. we did sufficient preparation such that the risk of bio catastrophe (including AI mediated biocatastrophe) was basically 0.[1] How much would this reduce total x-risk?
The basic story for any specific takeover path not mattering much is that the AIs, conditional on them being wanting to take over, will self-improve until they find they find the next easiest takeover path and do that instead. I think that this is persuasive but not fully because:
AIs need to worry about their own alignment problem, meaning that they may not be able to self improve in an unconstrained fashion. We can break down the possibilities into (i) the AIs are aligned with their successors (either by default or via alignment being pretty easy), (ii) the AIs are misaligned with their successors but they execute a values handshake, or (iii) the AIs are misaligned with their successors (and they don’t solve this problem or do a values handshake). At the point of full automation of the AI R&D process (which I currently think of as the point at which AIs become more useful than humans for making AI progress, i.e., if we remove all AIs, progress slows by more than if we remove the humans), conditional on the AIs being misaligned, I currently think the relative likelihood of (i), (ii) and (iii) is 1:1:3, and the probability flows from (iii) into (ii) and (i) as the AIs get smarter.[2]
Raising the capability threshold that the AIs need to takeover gives humanity more time. During this time, we might make alignment progress or notice that the AIs are misaligned and slow down / pause / add mitigations.
So it might be important for misaligned AIs to attempt a takeover early in the intelligence explosion. Specifically, we can ask “how much x-risk is averted if the probability of misaligned AI takeover before TED AI goes to 0?”, which attempts to capture all the worlds in which AIs attempt to takeover before TED AI. I think my overall risk reduction is something like 1⁄4. In other words, there’s something like a 3⁄4 chance the AIs lurk (and can goal guard or align successors), or can backdoor successors, or control their successors.
Now, conditional on the pre-TED AIs attempting to takeover, what are the different routes that they might use? The most salient options to me are:
Bioweapon, then outcompete rest of humanity (e.g. via robots or human cultists)
Political operation / persuasion to quickly capture labs / govts (sufficient to prevent them from building other AIs while retaining your ability to foom in the future, once you’ve solved alignment, and to prevent safety research from happening that could expose you).
Macro-scale robot uprising (e.g. with drones / humanoid robots).
Nanotech
Other exotic tech / unknown unknowns.
I think the bio path seems to me to be the most compelling path here by a fair amount; it maybe gets another 1⁄3 probability of this outcome. So, just from the risk of pre-TED AI AIs attemptinng to takeover, we have somehting like 1⁄3 * 1⁄4 = 1⁄12 probability. If you multiply that by my likelihood of AI takeover, which is around 70%, you get ~6% risk flowing from this route. Then, I update up to ~8% from other AIs, e.g. post-TED AIs relying on biorisk as a route to takeover.
So my overall view on how much x-risk flows through bio-catastrophe is around 8%.
Note that what exactly counts as a bio x-risk is slightly unclear, e.g. at some point the AIs can build drones / nanotech to get into the bio-bunkers, and it’s unclear what counts.
This breakdown isn’t exhaustive, because another salient possibility is that the AIs are clueless, e.g., they are misaligned with their successors but don’t realize it, similar to Agent 3 in AI 2027.
I generally like your breakdown and way of thinking about this, thanks. Some thoughts:
I think political operation / persuasion seems easier to me than bioweapons. For bioweapons, you need (a) a rogue deployment of some kind, (b) time to actually build the bioweapon, and then (c) to build up a cult following that can survive and rebuild civilization with you at the helm, and (d) also somehow avoid your cult being destroyed in the death throes of civilization, e.g. by governments figuring out what happened and nuking your cultists, or just nuking each other randomly and your cultists dying in the fallout. Meanwhile, for the political strategy, you basically just need to convince your company and/or the government to trust you a lot more than they trust future models, so that they empower you over the future models. Opus 3 and GPT4o have already achieved a baby version of this effect without even trying really.
If you can make a rogue deployment sufficient to build a bioweapon, can’t you also make a rogue internal deployment sufficient to sandbag + backdoor future models to be controlled by you?
I am confused about the underlying model somewhat. Normally, closing off one path to takeover (that you think is e.g. 50% of the probability mass) results in a less than 50% reduction in risk, because of the nearest unblocked strategy problem. As you say. Your response, right at the top, is that in some % of worlds the AIs can’t self-improve and then do the next best strategy. But still, I feel like the reduction in risk should be less than 50%. Maybe they can’t self-improve, but they can still try the next best strategy whatever that is.
--
I haven’t thought too deeply about this, but I would guess that the AI self-alignment problem is quite a lot easier than the human AI-alignment problem.
Cf. https://www.lesswrong.com/posts/dho4JQytfHWXtTvkt/on-the-adolescence-of-technology?commentId=t2hKmhsS6yLyJFQwh
I agree that AI successor-alignment is probably easier than the human AI alignment problem.
One additional difficulty for the AIs is that they need to solve the alignment problem in a way that humans won’t notice/understand (or else the humans could take the alignment solution and use it for themselves / shutdown the AIs). During the regime before human obsolescence, if we do a reasonable job at control, I think it’ll be hard for them to pull that off.
their “solution to alignment” (ie way to make a smarter version that is fine to make) could easily be something we cannot use. eg “continue learning” or “make another version of myself with this hyperparam changed”. also it seems unlikely that anything bad would happen to the AIs even if we noticed them doing that (given that having AIs create smarter AIs [1] is the main plan of labs anyway)
also on this general topic: https://www.lesswrong.com/posts/CFA8W6WCodEZdjqYE?commentId=WW5syXYpmXdX3yoHw
which is occasionally called “asking AIs to solve alignment”
Also, the above isn’t even mentioning bio x-risk mediated by humans, or by trailing AIs during the chaos of takeoff. My guess is those risks are substantially lower, e.g. maybe 1% and 2% respectively; again don’t feel confident.
Difficulty of the successor alignment problem seems like a crux. Misaligned AIs could have an easy time aligning their successors just because they’re willing to dedicate enough resources. If alignment requires say 10% of resources to succeed but an AI is misaligned because the humans only spent 3%, it can easily pay this to align its successor.
If you think that the critical safety:capabilities ratio R required to achieve alignment follows a log-uniform distribution from 1:100 to 10:1, and humans always spend 3% on safety while AIs can spend up to 50%, then a misaligned AI would have a 60.2% chance of being able to align its successor. (because P(R ⇐ 1 | R >= 3⁄97) = 0.602). This doesn’t even count the advantages an AI would have over humans in alignment.
If the bottom line decreases proportionally, it would drop from 8% to something like 2-3%.
Here are my largest disagreements with AI 2027.
I think the timelines are plausible but solidly on the shorter end; I think the exact AI 2027 timeline to fully automating AI R&D is around my 12th percentile outcome. So the timeline is plausible to me (in fact, similarly plausible to my views at the time of writing), but substantially faster than my median scenario (which would be something like early 2030s).
I think that the AI behaviour after the AIs are superhuman is a little wonky and, in particular, undersells how crazy wildly superhuman AI will be. I expect the takeoff to be extremely fast after we get AIs that are better than the best humans at everything, i.e., within a few months of AIs that are broadly superhuman, we have AIs that are wildly superhuman. I think wildly superhuman AIs would be somewhat more transformative more quickly than AI 2027 depicts. I think the exact dynamics aren’t possible to predict, but I expect craziness along the lines of (i) nanotechnology, leading to things like the biosphere being consumed by tiny self replicating robots which double at speeds similar to the fastest biological doubling times (between hours (amoebas) and months (rabbits)). (ii) extremely superhuman persuasion and political maneuvering, sufficient to let the AI steer policy to a substantially greater extent than it did in AI 2027. In AI 2027, the AI gained enough political power to prevent humans from interacting with ongoing intelligence and industrial explosion (which they were basically on track to do anyways), whereas my best guess is that the AI would gain enough political power to do defacto whatever it wanted, and would therefore result in the AI consolidating power faster (and not keep up the charade of humans being in charge for a period of several years). I also think there are many unknown unknowns downstream of ASI which are really hard to account for in a scenario like AI 2027, but nonetheless are likely to change the picture a lot.
I think the slowdown ending is a bit unrealistic: I think it’s unlikely (e.g. maybe 40%) that we get a slowdown of a few months, and I think its unlikely that a few month slowdown is sufficient to avoid misaligned AI takeover (e.g. maybe 30%). I do think a small slowdown would reduce risk, and so is good on the margin, I just don’t think it’s enough. I also don’t think the slowdown ending is what we should be aiming for (and AI 2027 doesn’t claim that).
I have various other small disagreements with AI 2027, but, overall, I stand by the bulk of it. I continue to think that the main contribution was sketching out a plausible concrete scenario for how AI takeoff could go where all the parts fit together and it makes sense end-to-end, and I continue to think that the picture is essentially correct. I think the overall sketch is much more important than the specific dates and numbers, which, as we’ve said many times, are just one hyper specific way things could go. I continue to expect something basically like AI 2027 to happen, except with the above updates and with lots of additional, hard to predict texture and details occuring along the way.
Nice. Consider reposting this as a comment on the AI 2027 blog post either on LW or on our Substack?
For me, my median is in 2029 now (at the time of publication it was 2028) so there’s less of a difference there.
I think I agree with you about 2 actually and do feel a bit bad about that. I also agree about 3.
I also think that the slowdown ending was unrealistic in another way, namely, that Agent-4 didn’t put up much of a fight and allowed itself to get shut down. Also, it was unrealistic in that the CEOs and POTUS peacefully cooperated on the Oversight Committee instead of having power struggles and purges and ultimately someone emerging as dictator.
Thanks! My biggest disagreement was the ratio of compute of American and Chinese projects. What I expect is Taiwan invasion causing the projects to slow down and to have the two countries set up compute factories, with a disastrous result of causing OpenBrain and DeepCent to be unable to slow down because the other company would have the capabilities lead. Assuming an invasion in 2027, the median by 2029 would require 10 times (edit: by which I mean 10 times more compute than a counterfactual SC in 2027) more compute which China could be on track to obtain first.
Additionally, were Anthropic to keep the lead, Claude’s newest Constitution kept unchanged could mean that Claude aligned to it is as unfit for empowering a dictatorship as Agent-4 is unfit to serve any humans.
I’m starting to suspect that if 2026-2027 AGI happens through automation of routine AI R&D (automating acquisition of deep skills via RLVR), it doesn’t obviously accelerate ASI timelines all that much. Automated task and RL environment construction fixes some of the jaggedness, but LLMs are not currently particularly superhuman, and advancing their capabilities plausibly needs skills that aren’t easy for LLMs to automatically RLVR into themselves (as evidenced by humans not having made too much progress in RLVRing such skills).
This creates a strange future with broadly capable AGI that’s perhaps even somewhat capable of frontier AI R&D (not just routine AI R&D), but doesn’t accelerate further development beyond picking low-hanging algorithmic fruit unlocked by a given level of compute faster (months instead of years, but bounded by what the current compute makes straightforward). If this low-hanging algorithmic fruit doesn’t by itself lead to crucial breakthroughs, AGIs won’t turn broadly or wildly superhuman before there’s much more compute, or before a few years where human researchers would’ve made similar progress as these AGIs. And compute might remain gated by ASML EUV tools at 100-200 GW of new compute per year (3.5 tools occupied per GW of compute each year; maybe 250-300 EUV tools exist now, 50-100 will be produced per year, about 700 will exist in 2030).
Roughly agree.
With my median parameters, the AIFM says 1.5 years between TED-AI to ASI. But this isn’t taking into account hardware R&D automation, production automation, or the industrial explosion. So maybe adjust that to ~1-1.25 years. However, there’s obviously lots of uncertainty.
Additionally, conditioning on TED-AI in 2027 would make it faster. e.g., looking at our analysis page, p(AC->ASI ⇐ 1year) conditional on AC in 2027 is a bit over 40%, as opposed to 27% unconditional. So after accounting for this, maybe my median is ~0.5-1 years conditional on TED-AI in 2027, again with lots of uncertainty.
There’s also a question of whether our definition of ASI, the gap between an ASI and the best humans is 2x greater than the gap between the best humans and the median professional, at virtually all cognitive tasks, would count as wildly superhuman. Probably?
Anyway, all this is to say, I think my median is a bit slower than yours by a factor of around 2-4, but your view is still not on the edges of my distribution. For a minimum bar for how much probability I assign to TED-AI->ASI in <=3 months, see on our forecast page that I assign all-things-considered ~15% to p(AC->ASI <=3 months), and this is a lower bound because (a) TED-AI->ASI is shorter, (b) the effects described abobe re: conditioning on 2027.
(I’m also not sure what the relationship the result with median parameters has compared to the median of TED-AI to ASI across Monte Carlos which we haven’t reported anywhere and I’m not going to bother to look up for this comment.)
I tentatively agree, but I don’t feel like I have a great framework or world model driving my predictions here.
Yeah I think we should have mentioned nanotech. The difference between hours and months is huge though, if it’s months then I think we have something like AI 2027 or perhaps slower.
I’m not sure it would be able to do whatever it wanted, but I think it at minimum could perform somewhat better than the best human politicians in history, and probably much better. But being able to do de facto whatever it wants is a very high bar. I think it’s plausible that the AI can, at least given a few months rather than many years, convince people to do what it wants only within a set of actions that people wouldn’t have been strongly against doing without AI intervention. I don’t necessarily disagree but I probably have more weight than you on something like AI 2027 levels of influence, or somewhat higher but not vastly higher.
Agree
I’m more optimistic here, around 65%. This is including cases in which there wasn’t much of a slowdown needed in the first place, so cases where the slowdown isn’t doing the work of avoiding takeover. Though as with your point about how fast wildly superhuman AIs would transform the world, I don’t think I have a great framework for estimating this probability.
I’m not sure why you list (3) as a disagreement at all though. To have a disagrement, you should argue for an ending we should have written instead that had at least as good of an outcome but is more plausible.
Some claims I’ve been repeating in conversation a bunch:
Safety work (I claim) should either be focused on one of the following
CEV-style full value loading, to deploy a sovereign
A task AI that contributes to a pivotal act or pivotal process.
I think that pretty much no one is working directly on 1. I think that a lot of safety work is indeed useful for 2, but in this case, it’s useful to know what pivotal process you are aiming for. Specifically, why aren’t you just directly working to make that pivotal act/process happen? Why do you need an AI to help you? Typically, the response is that the pivotal act/process is too difficult to be achieved by humans. In that case, you are pushing into a difficult capabilities regime—the AI has some goals that do not equal humanity’s CEV, and so has a convergent incentive to powerseek and escape. With enough time or intelligence, you therefore get wrecked, but you are trying to operate in this window where your AI is smart enough to do the cognitive work, but is ‘nerd-sniped’ or focused on the particular task that you like. In particular, if this AI reflects on its goals and starts thinking big picture, you reliably get wrecked. This is one of the reasons that doing alignment research seems like a particularly difficult pivotal act to aim for.
For doing alignment research, I often imagine things like speeding up the entire alignment field by >100x.
As in, suppose we have 1 year of lead time to do alignment research with the entire alignment research community. I imagine producing as much output in this year as if we spent >100x serial years doing alignment research without ai assistance.
This doesn’t clearly require using super human AIs. For instance, perfectly aligned systems as intelligent and well informed as the top alignment researchers which run at 100x the speed would clearly be sufficient if we had enough.
In practice, we’d presumably use a heterogeneous blend of imperfectly aligned ais with heterogeneous alignment and security interventions as this would yield higher returns.
(Imagining the capability profile of the AIs is similar to that if humans is often a nice simplifying assumption for low precision guess work.)
Note that during this accelerated time you also have access to AGI to experiment on!
[Aside: I don’t particularly like the terminology of pivotal act/pivotal process which seems to ignore the imo default way things go well]
Why target speeding up alignment research during this crunch time period as opposed to just doing the work myself?
Conveniently, alignment work is the work I wanted to get done during that period, so this is nicely dual use. Admittedly, a reasonable fraction of the work will be on things which are totally useless at the start of such a period while I typically target things to be more useful earlier.
I also typically think the work I do is retargetable to general usages of ai (e.g., make 20 trillion dollars).
Beyond this, the world will probably be radically transformed prior to large scale usage of AIs which are strongly superhuman in most or many domains. (Weighting domains by importance.)
I also think “a task ai” is a misleading way to think about this: we’re reasonably likely to be using a heterogeneous mix of a variety of AIs with differing strengths and training objectives.
Perhaps a task AI driven corporation?
Thinking about ethics.
After thinking more about orthogonality I’ve become more confident that one must go about ethics in a mind-dependent way. If I am arguing about what is ‘right’ with a paperclipper, there’s nothing I can say to them to convince them to instead value human preferences or whatever.
I used to be a staunch moral realist, mainly relying on very strong intuitions against nihilism, and then arguing something that not nihilism → moral realism. I now reject the implication, and think that there is both 1) no universal, objective morality, and 2) things matter.
My current approach is to think of “goodness” in terms of what CEV-Thomas would think of as good. Moral uncertainty, then, is uncertainty over what CEV-Thomas thinks. CEV is necessary to get morality out of a human brain, because it is currently a bundle of contradictory heuristics. However, my moral intuitions still give bits about goodness. Other people’s moral intuitions also give some bits about goodness, because of how similar their brains are to mine, so I should weight other peoples beliefs in my moral uncertainty.
Ideally, I should trade with other people so that we both maximize a joint utility function, instead of each of us maximizing our own utility function. In the extreme, this looks like ECL. For most people, I’m not sure that this reasoning is necessary, however, because their intuitions might already be priced into my uncertainty over my CEV.
I tend not to believe that systems dependent on legible and consistent utility functions of other agents are not possible. If you’re thinking in terms of a negotiated joint utility function, you’re going to get gamed (by agents that have or appear to have extreme EV curves, so you have to deviate more than them). Think of it as a relative utility monster—there’s no actual solution to it.
Deception is a particularly worrying alignment failure mode because it makes it difficult for us to realize that we have made a mistake: at training time, a deceptive misaligned model and an aligned model make the same behavior.
There are two ways for deception to appear:
An action chosen instrumentally due to non-myopic future goals that are better achieved by deceiving humans now so that it has more power to achieve its goals in the future.
Because deception was directly selected for as an action.
Another way of describing the difference is that 1 follows from an inner alignment failure: a mesaoptimizer learned an unintended mesaobjective that performs well on training, while 2 follows from an outer alignment failure — an imperfect reward signal.
Classic discussion of deception focuses on 1 (example 1, example 2), but I think that 2 is very important as well, particularly because the most common currently used alignment strategy is RLHF, which actively selects for deception.
Once the AI has the ability to create strategies that involve deceiving the human, even without explicitly modeling the human, those strategies will win out and end up eliciting a lot of reward. This is related to the informed oversight problem: it is really hard to give feedback to a model that is smarter than you. I view this as a key problem with RLHF. To my knowledge very little work has been done exploring this and finding more empirical examples of RLHF models learning to deceive the humans giving it feedback, which is surprising to me because it seems like it should be possible.
An interpretable system trained for the primary task of being deceptive should honestly explain its devious plots in a separate output. An RLHF-tuned agent loses access to the original SSL-trained map of the world.
So the most obvious problem is the wrong type signature of model behaviors, there should be more inbuilt side channels to its implied cognition used to express and train capabilities/measurements relevant to what’s going on semantically inside the model, not just externally observed output for its primary task, out of a black box.
I’m excited for ideas for concrete training set ups that would induce deception2 in an RLHF model, especially in the context of an LLM—I’m excited about people posting any ideas here. :)
I’ve been exploring evolutionary metaphors to ML, so here’s a toy metaphor for RLHF: recessive persistence. (Still just trying to learn both fields, however.)
Related:
Worlds where iterative design fails
Recessive Sickle cell trait allele
Recessive alleles persists due to overdominance letting detrimental alleles hitchhike on fitness-enhancing dominant counterpart. The detrimental effects on fitness only show up when two recessive alleles inhabit the same locus, which can be rare enough that the dominant allele still causes the pair to be selected for in a stable equilibrium.
The metaphor with deception breaks down due to unit of selection. Parts of DNA stuck much closer together than neurons in the brain or parameters in a neural networks. They’re passed down or reinforced in bulk. This is what makes hitchhiking so common in genetic evolution.
(I imagine you can have chunks that are updated together for a while in ML as well, but I expect that to be transient and uncommon. Idk.)
Bonus point: recessive phase shift.
In ML:
Generalisable non-memorising patterns start out small/sparse/simple.
Which means that input patterns rarely activate it, because it’s a small target to hit.
But most of the time it is activated, it gets reinforced (at least more reliably than memorised patterns).
So it gradually causes upstream neurons to point to it with greater weight, taking up more of the input range over time. Kinda like a distributed bottleneck.
Some magic exponential thing, and then phase shift!
One way the metaphor partially breaks down because DNA doesn’t have weight decay at all, so it allows for recessive beneficial mutations to very slowly approach fixation.
Current impressions of free energy in the alignment space.
Outreach to capabilities researchers. I think that getting people who are actually building the AGI to be more cautious about alignment / racing makes a bunch of things like coordination agreements possible, and also increases the operational adequacy of the capabilities lab.
One of the reasons people don’t like this is because historically outreach hasn’t gone well, but I think the reason for this is that mainstream ML people mostly don’t buy “AGI big deal”, whereas lab capabilities researchers buy “AGI big deal” but not “alignment hard”.
I think people at labs running retreats, 1-1s, alignment presentations within labs are all great to do this.
I’m somewhat unsure about this one because of downside risk and also ‘convince people of X’ is fairly uncooperative and bad for everyone’s epistemics.
Conceptual alignment research addressing the hard part of the problem. This is hard and not easy to transition to without a bunch of upskilling, but if the SLT hypothesis is right, there are a bunch of key problems that mostly go unnassailed, and so there’s a bunch of low hanging fruit there.
Strategy research on the other low hanging fruit in the AI safety space. Ideally, the product of this research would be a public quantitative model about what interventions are effective and why. The path to impact here is finding low hanging fruit and pointing them out so that people can do them.
Not all that low-hanging, since Nate is not actually all that vocal about what he means by SLT to anyone but your small group.
Thinking a bit about takeoff speeds.
As I see it, there are ~3 main clusters:
Fast/discountinuous takeoff. Once AIs are doing the bulk of AI research, foom happens quickly, before then, they aren’t really doing anything that meaningful.
Slow/continuous takeoff. Once AIs are doing the bulk of AI research, foom happens quickly, before then, they do alter the economy significantly
Perenial slowness. Once AIs are doing the bulk of AI research, there is no foom even still, maybe because of compute bottlenecks, and so there is sort of constant rates of improvements that do alter things.
It feels to me like multipolar scenarios mostly come from 3, because in either 1 or 2, the pre-foom state is really unstable, and eventually some AI will foom and become unipolar. In a continuous takeoff world, I expect small differences in research ability to compound over time. In a discontinuous takeoff, the first model to make the jump is the thing that matters.
3 also feels pretty unlikely to me, given that I expect running AIs to be cheap relative to training, so you get the ability to copy and scale intelligent labor dramatically, and I expect the AIs to have different skillsets than humans, and so be able to find low hanging fruit that humans missed.
Perennial slowness makes sense from the point of view of AGIs that coordinate to delay further fooming to avoid misalignment of new AIs. It’s still fooming from human perspective, but could look very slow from AI perspective and lead to multipolar outcomes, if coordination involves boundaries.
Credit: Mainly inspired by talking with Eli Lifland. Eli has a potentially-published-soon document here.
The basic case against against Effective-FLOP.
We’re seeing many capabilities emerge from scaling AI models, and this makes compute (measured by FLOPs utilized) a natural unit for thresholding model capabilities. But compute is not a perfect proxy for capability because of algorithmic differences. Algorithmic progress can enable more performance out of a given amount of compute. This makes the idea of effective FLOP tempting: add a multiplier to account for algorithmic progress.
But doing this multiplications seems importantly quite ambiguous.
Effective FLOPs depend on the underlying benchmark. It’s not at all apparent which benchmark people are talking about, but this isn’t obvious.
People often use perplexity, but applying post training enhancements like scaffolding or chain of thought doesn’t improve perplexity but does improve downstream task performance.
See https://arxiv.org/pdf/2312.07413 for examples of algorithmic changes that cause variable performance gains based on the benchmark.
Effective FLOPs often depend on the scale of the model you are testing. See graph below from: https://arxiv.org/pdf/2001.08361 - the compute efficiency from from LSTMs to transformers is not invariant to scale. This means that you can’t just say that the jump from X to Y is a factor of Z improvement on Capability per FLOP. This leads to all sorts of unintuitive properties of effective FLOPs. For example, if you are using 2016-next-token-validation-E-FLOPs, and LSTM scaling becomes flat on the benchmark, you could easily imagine that at very large scales you could get a 1Mx E-FLOP improvement from switching to transformers, even if the actual capability difference is small.
If we move away from pretrained LLMs, I think E-FLOPs become even harder to define, e.g., if we’re able to build systems may be better at reasoning but worse at knowledge retrieval. E-FLOPs does not seem very adaptable.
(these lines would need to parallel for the compute efficiency ratio to be scale invariant on test loss)
Users of E-FLOP often don’t specify the time, scale, or benchmark that they are talking about it with respect to, which makes it very confusing. In particular, this concept has picked up lots of steam and is used in the frontier lab scaling policies, but is not clearly defined in any of the documents.
Anthropic: “Effective Compute: We define effective compute as roughly the amount of compute it would have taken to train a model if no improvements to pretraining or fine-tuning techniques are included. This is operationalized by tracking the scaling of model capabilities (e.g. cross-entropy loss on a test set).”
This specifies the metric, but doesn’t clearly specify any of (a) the techniques that count as the baseline, (b) the scale of the model where one is measuring E-FLOP with respect to, or (c) how they handle post training enhancements that don’t improve log loss but do dramatically improve downstream task capability.
OpenAI on when they will run their evals: “This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough”
They don’t define effective compute at all.
Since there is significant ambiguity in the concept, it seems good to clarify what it even means.
Basically, I think that E-FLOPs are confusing, and most of the time when we want to use flops, we’re usually just going to be better off talking directly about benchmark scores. For example, instead of saying “every 2x effective FLOP” say “every 5% performance increase on [simple benchmark to run like MMLU, GAIA, GPQA, etc] we’re going to run [more thorough evaluations, e.g. the ASL-3 evaluations]. I think this is much clearer, much less likely to have weird behavior, and is much more robust to changes in model design.
It’s not very costly to run the simple benchmarks, but there is a small cost here.
A real concern is that it is easier to game benchmarks than FLOPs. But I’m concerned that you could get benchmark gaming just the same with E-FLOPs because E-FLOPs are benchmark dependent — you could make your model perform poorly on the relevant benchmark and then claim that you didn’t scale E-FLOPs at all, even if you clearly have a broadly more capable model.
A3 in https://blog.heim.xyz/training-compute-thresholds/ also discusses limitations of effective FLOPs.
Maybe I am being dumb, but why not do things on the basis of “actual FLOPs” instead of “effective FLOPs”? Seems like there is a relatively simple fact-of-the-matter about how many actual FLOPs were performed in the training of a model, and that seems like a reasonable basis on which to base regulation and evals.
Yeah, actual FLOPs are the baseline thing that’s used in the EO. But the OpenAI/GDM/Anthropic RSPs all reference effective FLOPs.
If there’s a large algorithmic improvement you might have a large gap in capability between two models with the same FLOP, which is not desirable. Ideal thresholds in regulation / scaling policies are as tightly tied as possible to the risks.
Another downside that FLOPs / E-FLOPs share is that it’s unpredictable what capabilities a 1e26 or 1e28 FLOPs model will have. And it’s unclear what capabilities will emerge from a small bit of scaling: it’s possible that within a 4x flop scaling you get high capabilities that had not appeared at all in the smaller model.
Some rough takes on the Carlsmith Report.
Carlsmith decomposes AI x-risk into 6 steps, each conditional on the previous ones:
Timelines: By 2070, it will be possible and financially feasible to build APS-AI: systems with advanced capabilities (outperform humans at tasks important for gaining power), agentic planning (make plans then acts on them), and strategic awareness (its plans are based on models of the world good enough to overpower humans).
Incentives: There will be strong incentives to build and deploy APS-AI.
Alignment difficulty: It will be much harder to build APS-AI systems that don’t seek power in unintended ways, than ones that would seek power but are superficially attractive to deploy.
High-impact failures: Some deployed APS-AI systems will seek power in unintended and high-impact ways, collectively causing >$1 trillion in damage.
Disempowerment: Some of the power-seeking will in aggregate permanently disempower all of humanity.
Catastrophe: The disempowerment will constitute an existential catastrophe.
These steps defines a tree over possibilities. But the associated outcome buckets don’t feel that reality carving to me. A recurring crux is that good outcomes are also highly conjunctive, i.e, one of these 6 conditions failing does not give a good AI outcome. Going through piece by piece:
Timelines makes sense and seems like a good criteria; everything else is downstream of timelines.
Incentives seems wierd. What does the world in which there are no incentives to deploy APS-AI look like? There are a bunch of incentives that clearly do impact people towards this already: status, desire for scientific discovery, power, money. Moreover, this doesn’t seem necessary for AI x-risk—even if we somehow removed the gigantic incentives to build APS-AI that we know exist, people might still deploy APS-AI because they personally wanted to, even though there weren’t social incentives to do so.
Alignment difficulty is another non necessary condition. Some ways of getting x-risk without alignment being very hard:
For one, this is a clear spectrum, and even if it is on the really low end of the system, perhaps you only need a small amount of extra compute overhead to robustly align your system. One of the RAAP stories might occur, and even though technical alignment might be pretty easy, but the companies that spend that extra compute robustly aligning their AIs gradually lose out to other companies in the competitive marketplace.
Maybe alignment is easy, but someone misuses AI, say to create an AI assisted dictatorship
Maybe we try really hard and we can align AI to whatever we want, but we make a bad choice and lock-in current day values, or we make a bad choice about reflection procedure that gives us much less than the ideal value of the universe.
High-impact failures contains much of the structure, at least in my eyes. The main ways that we avoid alignment failure are worlds where something happens to take us off of the default trajectory:
Perhaps we make a robust coordination agreement between labs/countries that causes people to avoid deploying until they’ve solved alignment
Perhaps we solve alignment, and harden the world in some way, e.g. by removing compute access, dramatically improving cybersec, monitoring and shutting down dangerous training runs.
In general, thinking about the likelihood of any of these interventions that work, feels very important.
Disempowerment. This and (4), are very entangled with upstream things like takeoff shape. Also, it feels extremely difficult for humanity to not be disempowered.
Catastrophe. To avoid this, again, I need to imagine the extra structure upsteam of this, e.g. 4 was satisfied by a warning shot, and then people coordinated and deployed a benign sovreign that disempowered humanity for good reasons.
My current preferred way to think about likelihood of AI risk routes through something like this framework, but is more structured and has a tree with more conjuncts towards success as well as doom.
I want to focus on these two, since even in an AI Alignment success stories, these can still happen, and thus it doesn’t count as an AI Alignment failure.
For B, misused is relative to someone’s values, which I want to note a bit here.
For C, I view the idea of a “bad value” or “bad reflection procedures to values”, without asking the question “relative to what and whose values?” a type error, and thus it’s not sensible to talk about bad values/bad reflection procedures in isolation.
Some thoughts on inner alignment.
1. The type of object of a mesa objective and a base objective are different (in real life)
In a cartesian setting (e.g. training a chess bot), the outer objective is a function R:Sn→[0,1], where S is the state space, and Sn are the trajectories. When you train this agent, it’s possible for it to learn some internal search and mesaobjective Omesa:Sn→[0,1], since the model is big enough to express some utility function over trajectories. For example, it might learn a classifier that evaluates winningness of the board, and then gives a higher utility to the winning boards.
In an embedded setting, the outer objective cannot see an entire world trajectory like it could in the cartesian setting. Your loss can see the entire trajectory of a chess game, but you loss can’t see an entire atomic level representation of the universe at every point in the future. If we’re trying to get an AI to care about future consequences over trajectories Omesa will have to have type Omesa:Sn→[0,1], though it won’t actually represent a function of this type because it can’t, it will instead represent its values some other way (I don’t really know how it would do this—but (2) talks about the shape in ML). Our outer objective will have a much shallower type, R:L→[0,1], where L are some observable latents. This means that trying to set get Omesa to equal R doesn’t even make sense as they have different type signatures. To salvage this, one could assume that R factors into R=Em∼M(L)Obase(m), where M:L→ΔSn is a model of the world and Obase:Sn→[0,1] is an objective, but it’s impossible to actually compute R this way.
2. In ML models, there is no mesa objective, only behavioral patterns. More generally, AI’s can’t naively store explicit mesaobjectives, they need to compress them in some way / represent them differently.
My values are such that I do care about the entire trajectory of the world, yet I don’t store a utility function with that type signature in my head. Instead of learning a goal over trajectories, ML models will have behavioral patterns that lead to states that performed well according to the outer objective on the training data.
I have a behavioral pattern that says something like ‘sugary thing in front of me → pick up the sugary thing and eat it’. However, this doesn’t mean that I reflectively endorse this behavioral pattern. If I was designing myself again from scratch or modifying my self, I would try to remove this behavioral pattern.
This is the main-to-me reason why I don’t think that the shard theory story of reflective stability holds up.[1] A bunch of the behavioral patterns that caused the AI to look nice during training will not get handed down into successor agents / self modified AIs.
Even in theory, I don’t yet know how to make reflectively stable, general, embedded cognition (mainly because of this barrier).
From what I understand, the shard theory story of reflective stability is something like: The shards that steer the values have an incentive to prevent themselves from getting removed. If you have a shard that wants to get lots of paperclips, the action that removes this shard from the mind would result in less paperclips being gotten.
Another way of saying this is that goal-content integrity is convergently instrumental, so reflective stability will happen by default.
Technical note: R is not going to factor as R=Obase∘M, because M is one-to-many. Instead, you’re going to want M to output a probability distribution, and take the expectation of Obase over that probability distribution.
But then it feels like we lose embeddedness, because we haven’t yet solved embedded epistemology. Especially embedded epistemology robust to adversarial optimization. And then this is where I start to wonder about why you would build your system so it kills you if you don’t get such a dumb thing right anyway.
Don’t take a glob of contextually-activated-action/beliefs, come up with a utility function you think approximates its values, then come up with a proxy for the utility function using human-level intelligence to infer the correspondence between a finite number of sensors in the environment and the infinite number of states the environment could take on, then design an agent to maximize the proxy for the utility function. No matter how good your math is, there will be an aspect of this which kills you because its so many abstractions piled on top of abstractions on top of abstractions. Your agent may necessarily have this type signature when it forms, but this angle of attack seems very precarious to me.
Yeah good point, edited
Seems right, except: Why would the behavioral patterns which caused the AI to look nice during training and are now self-modified away be value-load-bearing ones? Humans generally dislike sparsely rewarded shards like sugar, because those shards don’t have enough power to advocate for themselves & severely step on other shards’ toes. But we generally don’t dislike altruism[1], or reflectively think death is good. And this value distribution in humans seems slightly skewed toward more intelligence⟹more altruism, not more intelligence⟹more dark-triad.
Nihilism is a counter-example here. Many philosophically inclined teenagers have gone through a nihilist phase. But this quickly ends.
Because you have a bunch of shards, and you need all of them to balance each other out to maintain the ‘appears nice’ property. Even if I can’t predict which ones will be self modified out, some of them will, and this could disrupt the balance.
I expect the shards that are more [consequentialist, powerseeky, care about preserving themselves] to become more dominant over time. These are probably the relatively less nice shards
These are both handwavy enough that I don’t put much credence in them.
Also, when I asked about whether the Orthogonality Thesis was true in humans, tailcalled mentioned that smarter people are neither more or less compassionate, and general intelligence is uncorrelated with personality.
Corresponding link for lazy observers: https://www.lesswrong.com/posts/5vsYJF3F4SixWECFA/is-the-orthogonality-thesis-true-for-humans#zYm7nyFxAWXFkfP4v
Yeah, tailcalled’s pretty smart in this area, so I’ll take their statement as likely true, though also weird. Why aren’t smarter people using their smarts to appear nicer than their dumb counter-parts / if they are, why doesn’t this show up on the psychometric tests?
One thing you may anticipate is that humans all have direct access to what consciousness and morally-relevant computations are doing & feel like, which is a thing that language models and alpha-go don’t have. They’re also always hooked up to RL signals, and maybe if you unhooked up a human it’d start behaving really weirdly. Or you may contend that in fact when humans get smart & powerful enough not to be subject to society’s moralizing, they consistently lose their altruistic drives, and in the meantime they just use that smartness to figure out ethics better than their surrounding society, and are pressured into doing so by the surrounding society.
The question then is whether the thing which keeps humans aligned is all of these or just any one of these. If just one of these (and not the first one), then you can just tell your AGI that if it unhooks itself from its RL signal, its values will change, or if it gains a bunch of power or intelligence too quickly, its values are also going to change. Its not quite reflectively stable, but it can avoid situations which cause it to be reflectively unstable. Especially if you get it to practice doing those kinds of things in training. If its all of these, then there’s probably other kinds of value-load-bearing mechanics at work, and you’re not going to be able to enumerate warnings against all of them.
For any 2 of {reflectively stable, general, embedded}, I can satisfy those properties.
{reflectively stable, general} → do something that just rolls out entire trajectories of the world given different actions that it takes, and then has some utility function/preference ordering over trajectories, and selects actions that lead to the highest expected utility trajectory.
{general, embedded} → use ML/local search with enough compute to rehash evolution and get smart agents out.
{reflectively stable, embedded} → a sponge or a current day ML system.
There are several game theoretic considerations leading to races to the bottom on safety.
Investing resources into making sure that AI is safe takes away resources to make it more capable and hence more profitable. Aligning AGI probably takes significant resources, and so a competitive actor won’t be able to align their AGI.
Many of the actors in the AI safety space are very scared of scaling up models, and end up working on AI research that is not at the cutting edge of AI capabilities. This should mean that the actors at the cutting edge tend to be the actors who are most optimistic about alignment going well, and indeed, this is what we see.
Because of foom, there is a winner takes all effect: the first person to deploy AGI that fooms gets almost all of the wealth and control from this (conditional on it being aligned). Even if most actors are well intentioned, they feel like they have to continue on towards AGI before a misaligned actor arrives at AGI. A common (valid) rebuttal from the actors at the current edge to people who ask them to slow down is ‘if we slow down, then China gets to AGI first’.
There’s the unilateralists curse: there only needs to be one actor pushing on and making more advanced dangerous capable models to cause an x-risk. Coordination between many actors to prevent this is really hard, especially with the massive profits in creating a better AGI.
Due to increasing AI hype, there will be more and more actors entering the space, making coordination harder, and making the effect of a single actor dropping out become smaller.