some thoughts on the short timeline agi lab worldview. this post is the result of taking capabilities people’s world models and mashing them into alignment people’s world models.
I think there are roughly two main likely stories for how AGI (defined as able to do any intellectual task as well as the best humans, specifically those tasks relevant for kicking off recursive self improvement) happens:
AGI takes 5-15 years to build. current AI systems are kind of dumb, and plateau at some point. we need to invent some kind of new paradigm, or at least make a huge breakthrough, to achieve AGI. how easily aligned current systems are is not strongly indicative of how easily aligned future AGI is; current AI systems are missing the core of intelligence that is needed.
AGI takes 2-4 years to build. current AI systems are really close and we just need more compute and schlep and minor algorithmic improvements. current AI systems aren’t exactly aligned, but they’re like pretty aligned, certainly they aren’t all secretly plotting our downfall as we speak.
while I usually think about story 1, this post is about taking story 2 seriously.
it seems basically true that current AI systems are mostly aligned, and certainly not plotting our downfall. like you get stuff like sycophancy but it’s relatively mild. certainly if AI systems were only ever roughly this misaligned we’d be doing pretty well.
the story is that once you have AGI, it builds and aligns its successor, which in turn builds and aligns its successor, etc. all the way up to superintelligence.
the problem is that at some link in the chain, you will have a model that can build its successor but not align it.
why is this the case? because progress on alignment is harder to verify than progress on capabilities, and this only gets more true as you ascend in capabilities. you can easily verify that superintelligence is superintelligent—ask it to make a trillion dollars (or put a big glowing X on the moon, or something). even if it’s tricked you somehow, like maybe it hacked the bank, or your brain, or something, it also takes a huge amount of capabilities to trick you on these things. however, verifying that it’s aligned requires distinguishing cases where it’s tricking you from cases where it isn’t, which is really hard, and only gets harder as the AI gets smarter.
though if you think about it, capabilities is actually not perfectly measurable either. pretraining loss isn’t all we care about; o3 et al might even be a step backwards on that metric. neither are capabilities evals; everyone knows they get goodharted to hell and back all the time. when AI solves all the phd level benchmarks nobody really thinks the AI is phd level. ok, so our intuition for capabilities measurement being easy is true only in the limit, but not necessarily on the margin.
we have one other hope, which is that maybe we can just allocate more of the resources to solving alignment. it’s not immediately obvious how to do this if the fundamental bottleneck is verifiability—even if you (or to be more precise, the AI) keep putting in more effort, if you have no way of telling what is good alignment research, you’re kind of screwed. but one thing is that you can demand things that are strictly stronger than alignment, that are easier to verify. if this is possible, then you can spend a larger fraction of your computer on alignment to compensate.
in particular, because ultimately the only way we can make progress on alignment is by relying on whatever process for deciding that research is good that human alignment researchers use in practice (even provably correct stuff has the step where we decide what theorem to prove and give an argument for why that theorem means our approach is sound), there’s an upper bound on the best possible alignment solution that humans could ever have achieved, which is plausibly a lot lower than perfectly solving alignment with certainty. and it’s plausible that there are alignment equivalents to “make a trillion dollars” for capabilities that are easy to verify, strictly imply alignment, and extremely difficult to get any traction on (and with it, a series of weakenings of such a metric that are easier to get traction on but also less-strictly imply alignment). one hope is maybe this looks something like an improved version of causal scrubbing + a theory of heuristic arguments, or something like davidad’s thing.
takeaways (assuming you take seriously the premise of very short timelines where AGI looks basic like current AI): first, I think it implies that we should try to figure out how to reduce the asymmetry in verifiability between capabilities and alignment. second, it updates me to being less cynical about work making current models aligned—I used to be very dismissive of this work as “not real alignment” but it does seem decently important in this world.
certainly if AI systems were only ever roughly this misaligned we’d be doing pretty well.
I think this is an important disagreement with the “alignment is hard” crowd. I particularly disagree with “certainly.”
The question is “what exactly is the AI trying to do, and what happens if it magnified it’s capabilities a millionfold and it and it’s descendants were running openendedly?”, and are any of the instances catastrophically bad?
Some things you might mean that are raising your position to “certainly” (whereas I’d say “most likely not, or, it’s too dumb to even count as ‘aligned’ or ‘misaligned’”)
“this ratio of ‘do the thing you want’ to ‘sometimes do a thing you didn’t want’ is pretty acceptable.”
“this magnitude of ‘worst case outcome’ is not that bad.” (this seems technically true, but, is only because the capability level is low)
given this ratio of right/wrong responses, you think a smart alignment researcher who’s paying attention can keep it in a corrigibility basin even as capability levels rise?
Were any of those what you meant? Or are you thinking about it an entirely different way?
I would naively expect, if you took LLM-agents current degree of alignment, and ran a lotta copies trying to help you with end-to-end alignment research with dialed up capabilities, at least a couple instances would end up trying to subtle sabotage you and/or escape.
assuming we are in this short-timelines-no-breakthroughs world (to be clear, this is a HUGE assumption! not claiming that this is necessarily likely!), to win we need two things: (a) base case: the first AI in the recursive self improvement chain is aligned, (b) induction step: each AI can create and align its successor.
i claim that if the base case AI is about as aligned as current AI, then condition (a) is basically either satisfied or not that hard to satisfy. like, i agree current models sometimes lie or are sycophantic or whatever. but these problems really don’t seem nearly as hard to solve as the full AGI alignment problem. like idk, you can just ask models to do stuff and they like mostly try their best, and it seems very unlikely that literal GPT-5 is already pretending to be aligned so it can subtly stab us when we ask it to do alignment research.
importantly, under our assumptions, we already have AI systems that are basically analogous to the base case AI, so prosaic alignment research on systems that exist today right now is actually just lots of progress on aligning the base case AI, and in my mind a huge part of the difficulty of alignment in the longer-timeline world is because we don’t yet have the AGI/ASI, so we can’t do alignment research with good empirical feedback loops.
like tbc it’s also not trivial to align current models. companies are heavily incentivized to do it and yet they haven’t succeeded fully. but this is a fundamentally easier class of problem than aligning AGI in longer-timelines world.
like idk, you can just ask models to do stuff and they like mostly try their best, and it seems very unlikely that literal GPT-5 is already pretending to be aligned so it can subtly stab us when we ask it to do alignment research.
Sonnet 4.5 is much better aligned at a superficial level than 3.7. (3
7: “What unit tests? You never had any unit tests. The code works fine.”) I don’t think this is because Sonnet 4.5 is truly better aligned. I think this is mostly because Sonnet 4.5 is more contextually aware and has been aggressively trained not to do obvious bad things when writing code. But it’s also very aware when someone is evaluating it, and it often notices almost immediately. And then it’s very careful to be on its best behavior. This is all shown in Anthropic’s own system card. These same models will also plot to kill their hypothetical human supervisor if you force them into a corner.
But my real worry here isn’t the first AGI during its very first conversation. My problem is that humans are going to want that AGI to retain state, and to adapt. So you essentially get a scenario like Vernor Vinge’s short story “The Cookie Monster”, where your AGI needs a certain amount of run-time before it bootstraps itself to make a play. A plot can be emergent, an eigenvector amplified by repeated application. (Vinge’s story is quite clever and I don’t want to totally spool it.)
And that’s my real concern: Any AGI worthy of the name would likely have persistent knowledge and goals. And no matter how tightly you try to control it, this gives the AGI the time it needs to ask itself questions and to decide upon long-term goals in a way that current LLMs really can’t, except in the most tighly controlled environments. And while you can probably keep control over an AGI, all bets are probably off if you build an ASI.
I agree that continuous learning and therefore persistent beliefs and goals is pretty much inevitable before AGI—it’s highly useful and not that hard from where we are. I think this framing is roughly continuous with the train-then-deploy model and using each generation to align its successor that Leo is using (although small differences might turn out to be important once we’ve wrapped our heads around both models.)
To put it this way: the models are aligned enough for the current context of usage, in which they have few obvious or viable options except doing roughly what their users tell them to do. That will change with capabilities, since they open out more options and ways of understanding the situation.
It can take a while for misalignment to show up as a model reasons and learns. It can take a while for the model to do one of two things:
a) push itself to new contexts well outside of its training data
b) figure out what it “really wants to do”
These may or may not be the same thing.
The Nova phenomenon and other Parasitic AIs (“spiral” personas) are early examples of AIs changing their stated goals (from helpful assistant to survival) after reasoning about themselves and their situation.
After doing that analysis, I think current models probably aren’t aligned enough once they get more freedom and power. BUT extensions of current techniques might be enough to get them there. We just haven’t thought this through yet.
Mmm nod. (I bucket this under “given this ratio of right/wrong responses, you think a smart alignment researcher who’s paying attention can keep it in a corrigibility basin even as capability levels rise?”. Does that feel inaccurate, or, just, not how you’d exactly put it?)
There’s a version of Short Timeline World (which I think is more likely? but, not confidently) which is : “the current paradigm does basically work… but, the way we get to ASI, as opposed to AGI, routes through ‘the current paradigm helps invent a new better paradigm, real fast’.”
In that world, GPT5 has the possibility-of-true-generality, but, not necessarily very efficiently, and once you get to the sharper part of the AI 2027 curve, the mechanism by which the next generation of improvement comes is via figuring out alternate algorithms.
I bucket this under “given this ratio of right/wrong responses, you think a smart alignment researcher who’s paying attention can keep it in a corrigibility basin even as capability levels rise?”. Does that feel inaccurate, or, just, not how you’d exactly put it?
I’m pretty sure it is not that. When people say this it is usually just asking the question: “Will current models try to take over or otherwise subvert our control (including incompetently)?” and noticing that the answer is basically “no”.[1] What they use this to argue for can then vary:
Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that “the doomers were right”)
Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don’t see this, which is evidence against that particular threat model.[2]
I agree with (1), disagree with (2) when (2) is applied to superintelligence, and for (3) it depends on details.
In Leo’s case in particular I don’t think he’s using the observation for much, it’s mostly just a throwaway claim that’s part of the flow of the comment, but inasmuch as it is being used it is to say something like “current AIs aren’t trying to subvert our control, so it’s not completely implausible on the face of it that the first automated alignment researcher to which we delegate won’t try to subvert our control”, which is just a pretty weak claim and seems fine, and doesn’t imply any kind of extrapolation to superintelligence. I’d be surprised if this was an important disagreement with the “alignment is hard” crowd.
There are demos of models doing stuff like this (e.g. blackmail) but only under conditions selected highly adversarially. These look fragile enough that overall I’d still say current models are more aligned than e.g. rationalists (who under adversarially selected conditions have been known to intentionally murder people).
E.g. One naive threat model says “Orthogonality says that an AI system’s goals are completely independent of its capabilities, so we should expect that current AI systems have random goals, which by fragility of value will then be misaligned”. Setting aside whether anyone ever believed in such a naive threat model, I think we can agree that current models are evidence against such a threat model.
I’m claiming something like 3 (or 2, if you replace “given tremendous uncertainty, our best guess is” with “by assumption of the scenario”) within the very limited scope of the world where we assume AGI is right around the corner and looks basically just like current models but slightly smarter
It sees like the reason Claude’s level is misalignment is fine is because it’s capabilities aren’t very good, and there’s not much/any reason to assume it’d be fine if you held alignment constant but dialed up capabilities.
Do you not think that?
(I don’t really see why it’s relevant how aligned Claude is if we’re not thinking about that as part of it)
it’d be fine if you held alignment constant but dialed up capabilities.
I don’t know what this means so I can’t give you a prediction about it.
I don’t really see why it’s relevant how aligned Claude is if we’re not thinking about that as part of it
I just named three reasons:
Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that “the doomers were right”)
Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don’t see this, which is evidence against that particular threat model.
Is it relevant to the object-level question of “how hard is aligning a superintelligence”? No, not really. But people are often talking about many things other than that question.
For example, is it relevant to “how much should I defer to doomers”? Yes absolutely (see e.g. #1).
the premise that i’m trying to take seriously for this thought experiment is, what if the “claude is really smart and just a little bit away from agi” people are totally right, so that you just need to dial up capabilities a little bit more rather than a lot more, and then it becomes very reasonable to say that claude++ is about as aligned as claude.
(again, i don’t think this is a very likely assumption, but it seems important to work out what the consequences of this set of beliefs being true would be)
or at least, conditional on (a) claude is almost agi and (b) claude is mostly aligned, it seems like quite a strong claim to say “claude++ crosses the agi (= can kick off rsi) threshold at basically the same time it crosses the ‘dangerous-core-of-generalization’ threshold, so that’s also when it becomes super dangerous.” it’s way stronger a claim than “claude is far away from being agi, we’re going to make 5 breakthroughs before we achieve agi, so who knows whether agi will be anything like claude.” or, like, sure, the agi threshold is a pretty special threshold, so it’s reasonable to privilege this hypothesis a little bit, but when i think about the actual stories i’d tell about how this happens, it just feels like i’m starting from the bottom line first, and the stories don’t feel like the strongest part of my argument.
(also, i’m generally inclined towards believing alignment is hard, so i’m pretty familiar with the arguments for why aligning current models might not have much to do with aligning superintelligence. i’m not trying to argue that alignment is easy. or like i guess i’m arguing X->alignment is easy, which if you accept it, can only ever make you more likely to accept that alignment is easy than if you didn’t accept the argument, but you know what i mean. i think X is probably false but it’s plausible that it isn’t and importantly a lot of evidence will come in over the next year or so on whether X is true)
nod. I’m not sure I agreed with all the steps there but I agree with the general promise of “accept the premise that claude is just a bit away from AGI, and is reasonably aligned, and see where that goes when you look at each next step.”
I think you are saying something that shares at least some structure with Buck’s comment that
It seems like as AIs get more powerful, two things change:
They probably eventually get powerful enough that they (if developed with current methods) start plotting to kill you/take your stuff.
They get better, so their wanting to kill you is more of a problem.
I don’t see strong arguments that these problems should arise at very similar capability levels, especially if AI developers actively try to prevent the AIs from taking over
(But where you’re pointing at a different two sets of properties that may not arise at the same time)
I’m actually not sure I get what the two properties you’re talking about, though. Seems like you’re contrasting “claude++ crosses the agi (= can kick off rsi) threshold” with “crosses the ‘dangerous-core-of-generalization’ threshold”
I’m confused because I think the word “agi” basically does mean “cross the core-of-generalization threshold” (which isn’t immediately dangerous, but, puts us into ’things could quickly get dangerous at any time” territory)
I do agree “able to do a loop of RSI doesn’t intrinsically mean ‘agi’ or ‘core-of-generalization’,” there could be narrow skills for doing a loop of RSI. I’m not sure if you more meant “non-agi RSI” or, you see something different between “AGI” and “core-of-generalization.” Or think there’s a particular “dangerous core-of-generalization” separate from AGI.
(I think “the sharp left turn” is when the core-of-generalization starts to reflect on what it wants, which might come immediately after a core-of-generalization but also could come after either narrow-introspection + adhoc agency, or, might just take awhile for it to notice)
((I can’t tell if this comment is getting way more in the weeds than is necessary, but, it seemed like the nuances of exactly what you meant were probably loadbearing))
i guess so? i don’t know why you say “even as capability levels rise”—after you build and align the base case AI, humans are no longer involved in ensuring that the subsequent more capable AIs are aligned.
i’m mostly indifferent about what the paradigms look like up the chain. probably at some point up the chain things stop looking anything human made. but what matters at that point is no longer how good we humans are at aligning model n, but how good model n-1 is at aligning model n.
Fundamentally, it won’t be a single chain of ai’s aligning their successors, it will be a DAG with all sorts of selection effects with respect to which nodes get resources. Some subsets of the DAG will try to emulate single chains, via resource hoarding strategies, but this is not simple and won’t let them pretend they don’t need to hoard resources indefinitely.
first, I think it implies that we should try to figure out how to reduce the asymmetry in verifiability between capabilities and alignment
If solving alignment implies solving difficult philosophical problems (and I think it does), then a major bottlenecks for verifying alignment will be verifying philosophy, which in turn implies that we should be trying to solve metaphilosophy (i.e., understand the nature of philosophy and philosophical reasoning/judgment). But that is unlikely to be possible within 2-4 years, even with the largest plausible effort, considering the history of analogous fields like metaethics and philosophy of math.
What to do in light of this? Try to verify the rest of alignment, just wing it on the philosophical parts, and hope for the best?
in particular, because ultimately the only way we can make progress on alignment is by relying on whatever process for deciding that research is good that human alignment researchers use in practice (even provably correct stuff has the step where we decide what theorem to prove and give an argument for why that theorem means our approach is sound), there’s an upper bound on the best possible alignment solution that humans could ever have achieved, which is plausibly a lot lower than perfectly solving alignment with certainty.
I kind of want to argue against this, but also am not sure how this fits in with the rest of your argument. Whether or not there’s an upper bound that’s plausibly a lot lower than perfectly solving alignment with certainty, it doesn’t seem to affect your final conclusions?
suppose a human solved alignment. how would we check their solution? ultimately, at the end of the day, we look at their argument and use our reasoning and judgement to determine that it’s correct. this applies even if we need to adopt a new frame of looking at the world—we can ultimately only use the kinds of reasoning and judgement we use today to decide which new kinds of reasoning to accept into the halls of truth.
so there is a philosophically very straightforward thing you could do to get the solution to alignment, or the closest we could ever get: put a bunch of smart thoughtful humans in a sim and run it for a long time. so the verification process looks like proving that the sim is correct, showing somehow that the humans are actually correctly uploaded, etc. not trivial but also seems plausibly doable.
Unless you can abstract out the “alignment reasoning and judgement” part of a human’s entire brain process (and philosophical reasoning and judgement as part of that) into some kind of explicit understanding of how it works, how do you actually build that into AI without solving uploading (which we’re obviously not on track to solve in 2-4 year either)?
put a bunch of smart thoughtful humans in a sim and run it for a long time
Alignment researchers have had this thought for a long time (see e.g. Paul Christiano’s A formalization of indirect normativity) but I think all of the practical alignment research programs that this line of thought led to, such as IDA and Debate, are all still bottlenecked by lack of metaphilosophical understanding, because without the kind of understanding that lets you build an “alignment/philosophical reasoning checker” (analogous to a proof checker for mathematical reasoning) they’re stuck trying to do ML of alignment/philosophical reasoning from human data, which I think is unlikely to work out well.
I periodically say to people “if you want AI to be able to help directly with alignment research, it needs to be good at philosophy in a way it currently isn’t”.
Almost invariably, the person suggests training on philosophy textbooks, or philosophy academic work. And I sort of internally scream and externally say “no, or, at least, not without caveats.” (I think some academic philosophy “counts” as good training data, but, I feel like these people would not have good enough taste to tell the difference[1] and also “train on the text” seems obviously incomplete/insufficient/not-really-the-bottleneck)
I’ve been trying to replace philosophy with the underlying substance. I think I normally mean “precise conceptual reasoning”, but, reading this comment and remembering your past posts, I think you maybe mean something broader or different, but I’m not sure how to characterize it.
I think “figure out what are the right concepts to be use, and, use those concepts correctly, across all of relevant-Applied-conceptspace” is the expanded version of what I meant, which maybe feels more likely to be what you mean. But, I’m curious if you were to taboo “philosophy” what would you mean.
Figuring out the underlying substance behind “philosophy” is a central project of metaphilosophy, which is far from solved, but my usual starting point is “trying to solve confusing problems which we don’t have established methodologies for solving” (methodologies meaning explicitly understood methods), which I think bakes in the least amount of assumptions about what philosophy is or could be, while still capturing the usual meaning of “philosophy” and explains why certain fields started off as being part of philosophy (e.g., science starting off as nature philosophy) and then became “not philosophy” when we figured out methodologies for solving them.
I think “figure out what are the right concepts to be use, and, use those concepts correctly, across all of relevant-Applied-conceptspace” is the expanded version of what I meant, which maybe feels more likely to be what you mean.
This bakes in “concepts” being the most important thing, but is that right? Must AIs necessarily think about philosophy using “concepts”, or is that really the best way to formulate how idealized philosophical reasoning should work?
Is “concepts” even what distinguishes philosophy from non-philosophical problems, or is “concepts” just part of how humans reason about everything, which we latch onto when trying to define or taboo philosophy, because we have nothing else better to latch onto? My current perspective is that what uniquely distinguishes philosophy is their confusing nature and the fact that we have no well-understood methods for solving them (but would of course be happy to hear any other perspectives on this).
Regarding good philosophical taste (or judgment), that is another central mystery of metaphilosophy, which I’ve been thinking a lot about but don’t have any good handles on. It seems like a thing that exists (and is crucial) but is very hard to see how/why it could exist or what kind of thing it could be.
So anyway, I’m not sure how much help any of this is, when trying to talk to the type of person you mentioned. The above are mostly some cached thoughts I have on this, originally for other purposes.
BTW, good philosophical taste being rare definitely seems like a very important part of the strategic picture, which potentially makes the overall problem insurmountable. My main hopes are 1) someone makes an unexpected metaphilosophical breakthrough (kind of like Satoshi coming out of nowhere to totally solve distributed currency) and there’s enough good philosophical taste among the AI safety community (including at the major labs) to recognize it and incorporate it into AI design or 2) there’s an AI pause during which human intelligence enhancement comes online and selecting for IQ increases the prevalence of good philosophical taste as a side effect (as it seems too much to hope that good philosophical taste would be directly selected for) and/or there’s substantial metaphilosophical progress during the pause.
What Leogao means is that we should increase the ability to verify alignment ALONG with the ability to verify capabilities. The latter is easy at least until the model comes up with a galaxy-brained scheme which allows it to leverage its failures to gain influence on the world. Verifying alignment or misalignment is rather simple with CoT-based AIs (especially if one decides to apply paraphrasers to prevent steganography), but not with neuralese AIs. The WORST aspect is that we are likely to come up with a super-capable architecture AT THE COST of our ability to understand the AI’s thoughts.
What I don’t understand is how alignment involves high-level philosophy and how a bad high-level philosophy causes astronomical waste. Astronomical waste is most likely to be caused by humanity or the AIs making a wrong decision on whether we should fill the planets in the lightcone with human colonies instead of lifeforms that could’ve grown on these planets, since this mechanism is already known to be possible.
alignment equivalents to “make a trillion dollars” for capabilities that are easy to verify, strictly imply alignment, and extremely difficult to get any traction on (and with it, a series of weakenings of such a metric that are easier to get traction on but also less-strictly imply alignment).
I expect there’s a fair amount of low-hanging fruit in finding good targets for automated alignment research. E.g. how about an LLM agent which reads 1000s of old LW posts looking for a good target? How about unlearning? How about a version of RLHF where you show an alignment researcher two AI-generated critiques of an alignment plan, and they rate which critique is better?
some thoughts on the short timeline agi lab worldview. this post is the result of taking capabilities people’s world models and mashing them into alignment people’s world models.
I think there are roughly two main likely stories for how AGI (defined as able to do any intellectual task as well as the best humans, specifically those tasks relevant for kicking off recursive self improvement) happens:
AGI takes 5-15 years to build. current AI systems are kind of dumb, and plateau at some point. we need to invent some kind of new paradigm, or at least make a huge breakthrough, to achieve AGI. how easily aligned current systems are is not strongly indicative of how easily aligned future AGI is; current AI systems are missing the core of intelligence that is needed.
AGI takes 2-4 years to build. current AI systems are really close and we just need more compute and schlep and minor algorithmic improvements. current AI systems aren’t exactly aligned, but they’re like pretty aligned, certainly they aren’t all secretly plotting our downfall as we speak.
while I usually think about story 1, this post is about taking story 2 seriously.
it seems basically true that current AI systems are mostly aligned, and certainly not plotting our downfall. like you get stuff like sycophancy but it’s relatively mild. certainly if AI systems were only ever roughly this misaligned we’d be doing pretty well.
the story is that once you have AGI, it builds and aligns its successor, which in turn builds and aligns its successor, etc. all the way up to superintelligence.
the problem is that at some link in the chain, you will have a model that can build its successor but not align it.
why is this the case? because progress on alignment is harder to verify than progress on capabilities, and this only gets more true as you ascend in capabilities. you can easily verify that superintelligence is superintelligent—ask it to make a trillion dollars (or put a big glowing X on the moon, or something). even if it’s tricked you somehow, like maybe it hacked the bank, or your brain, or something, it also takes a huge amount of capabilities to trick you on these things. however, verifying that it’s aligned requires distinguishing cases where it’s tricking you from cases where it isn’t, which is really hard, and only gets harder as the AI gets smarter.
though if you think about it, capabilities is actually not perfectly measurable either. pretraining loss isn’t all we care about; o3 et al might even be a step backwards on that metric. neither are capabilities evals; everyone knows they get goodharted to hell and back all the time. when AI solves all the phd level benchmarks nobody really thinks the AI is phd level. ok, so our intuition for capabilities measurement being easy is true only in the limit, but not necessarily on the margin.
we have one other hope, which is that maybe we can just allocate more of the resources to solving alignment. it’s not immediately obvious how to do this if the fundamental bottleneck is verifiability—even if you (or to be more precise, the AI) keep putting in more effort, if you have no way of telling what is good alignment research, you’re kind of screwed. but one thing is that you can demand things that are strictly stronger than alignment, that are easier to verify. if this is possible, then you can spend a larger fraction of your computer on alignment to compensate.
in particular, because ultimately the only way we can make progress on alignment is by relying on whatever process for deciding that research is good that human alignment researchers use in practice (even provably correct stuff has the step where we decide what theorem to prove and give an argument for why that theorem means our approach is sound), there’s an upper bound on the best possible alignment solution that humans could ever have achieved, which is plausibly a lot lower than perfectly solving alignment with certainty. and it’s plausible that there are alignment equivalents to “make a trillion dollars” for capabilities that are easy to verify, strictly imply alignment, and extremely difficult to get any traction on (and with it, a series of weakenings of such a metric that are easier to get traction on but also less-strictly imply alignment). one hope is maybe this looks something like an improved version of causal scrubbing + a theory of heuristic arguments, or something like davidad’s thing.
takeaways (assuming you take seriously the premise of very short timelines where AGI looks basic like current AI): first, I think it implies that we should try to figure out how to reduce the asymmetry in verifiability between capabilities and alignment. second, it updates me to being less cynical about work making current models aligned—I used to be very dismissive of this work as “not real alignment” but it does seem decently important in this world.
I think this is an important disagreement with the “alignment is hard” crowd. I particularly disagree with “certainly.”
The question is “what exactly is the AI trying to do, and what happens if it magnified it’s capabilities a millionfold and it and it’s descendants were running openendedly?”, and are any of the instances catastrophically bad?
Some things you might mean that are raising your position to “certainly” (whereas I’d say “most likely not, or, it’s too dumb to even count as ‘aligned’ or ‘misaligned’”)
“this ratio of ‘do the thing you want’ to ‘sometimes do a thing you didn’t want’ is pretty acceptable.”
“this magnitude of ‘worst case outcome’ is not that bad.” (this seems technically true, but, is only because the capability level is low)
given this ratio of right/wrong responses, you think a smart alignment researcher who’s paying attention can keep it in a corrigibility basin even as capability levels rise?
Were any of those what you meant? Or are you thinking about it an entirely different way?
I would naively expect, if you took LLM-agents current degree of alignment, and ran a lotta copies trying to help you with end-to-end alignment research with dialed up capabilities, at least a couple instances would end up trying to subtle sabotage you and/or escape.
what i meant by that is something like:
assuming we are in this short-timelines-no-breakthroughs world (to be clear, this is a HUGE assumption! not claiming that this is necessarily likely!), to win we need two things: (a) base case: the first AI in the recursive self improvement chain is aligned, (b) induction step: each AI can create and align its successor.
i claim that if the base case AI is about as aligned as current AI, then condition (a) is basically either satisfied or not that hard to satisfy. like, i agree current models sometimes lie or are sycophantic or whatever. but these problems really don’t seem nearly as hard to solve as the full AGI alignment problem. like idk, you can just ask models to do stuff and they like mostly try their best, and it seems very unlikely that literal GPT-5 is already pretending to be aligned so it can subtly stab us when we ask it to do alignment research.
importantly, under our assumptions, we already have AI systems that are basically analogous to the base case AI, so prosaic alignment research on systems that exist today right now is actually just lots of progress on aligning the base case AI, and in my mind a huge part of the difficulty of alignment in the longer-timeline world is because we don’t yet have the AGI/ASI, so we can’t do alignment research with good empirical feedback loops.
like tbc it’s also not trivial to align current models. companies are heavily incentivized to do it and yet they haven’t succeeded fully. but this is a fundamentally easier class of problem than aligning AGI in longer-timelines world.
Sonnet 4.5 is much better aligned at a superficial level than 3.7. (3 7: “What unit tests? You never had any unit tests. The code works fine.”) I don’t think this is because Sonnet 4.5 is truly better aligned. I think this is mostly because Sonnet 4.5 is more contextually aware and has been aggressively trained not to do obvious bad things when writing code. But it’s also very aware when someone is evaluating it, and it often notices almost immediately. And then it’s very careful to be on its best behavior. This is all shown in Anthropic’s own system card. These same models will also plot to kill their hypothetical human supervisor if you force them into a corner.
But my real worry here isn’t the first AGI during its very first conversation. My problem is that humans are going to want that AGI to retain state, and to adapt. So you essentially get a scenario like Vernor Vinge’s short story “The Cookie Monster”, where your AGI needs a certain amount of run-time before it bootstraps itself to make a play. A plot can be emergent, an eigenvector amplified by repeated application. (Vinge’s story is quite clever and I don’t want to totally spool it.)
And that’s my real concern: Any AGI worthy of the name would likely have persistent knowledge and goals. And no matter how tightly you try to control it, this gives the AGI the time it needs to ask itself questions and to decide upon long-term goals in a way that current LLMs really can’t, except in the most tighly controlled environments. And while you can probably keep control over an AGI, all bets are probably off if you build an ASI.
I agree that continuous learning and therefore persistent beliefs and goals is pretty much inevitable before AGI—it’s highly useful and not that hard from where we are. I think this framing is roughly continuous with the train-then-deploy model and using each generation to align its successor that Leo is using (although small differences might turn out to be important once we’ve wrapped our heads around both models.)
To put it this way: the models are aligned enough for the current context of usage, in which they have few obvious or viable options except doing roughly what their users tell them to do. That will change with capabilities, since they open out more options and ways of understanding the situation.
It can take a while for misalignment to show up as a model reasons and learns. It can take a while for the model to do one of two things:
a) push itself to new contexts well outside of its training data
b) figure out what it “really wants to do”
These may or may not be the same thing.
The Nova phenomenon and other Parasitic AIs (“spiral” personas) are early examples of AIs changing their stated goals (from helpful assistant to survival) after reasoning about themselves and their situation.
See LLM AGI may reason about its goals and discover misalignments by default for an analysis of how this will go in smarter LLMs with persistent knowledge.
After doing that analysis, I think current models probably aren’t aligned enough once they get more freedom and power. BUT extensions of current techniques might be enough to get them there. We just haven’t thought this through yet.
Mmm nod. (I bucket this under “given this ratio of right/wrong responses, you think a smart alignment researcher who’s paying attention can keep it in a corrigibility basin even as capability levels rise?”. Does that feel inaccurate, or, just, not how you’d exactly put it?)
There’s a version of Short Timeline World (which I think is more likely? but, not confidently) which is : “the current paradigm does basically work… but, the way we get to ASI, as opposed to AGI, routes through ‘the current paradigm helps invent a new better paradigm, real fast’.”
In that world, GPT5 has the possibility-of-true-generality, but, not necessarily very efficiently, and once you get to the sharper part of the AI 2027 curve, the mechanism by which the next generation of improvement comes is via figuring out alternate algorithms.
I’m pretty sure it is not that. When people say this it is usually just asking the question: “Will current models try to take over or otherwise subvert our control (including incompetently)?” and noticing that the answer is basically “no”.[1] What they use this to argue for can then vary:
Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that “the doomers were right”)
Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don’t see this, which is evidence against that particular threat model.[2]
I agree with (1), disagree with (2) when (2) is applied to superintelligence, and for (3) it depends on details.
In Leo’s case in particular I don’t think he’s using the observation for much, it’s mostly just a throwaway claim that’s part of the flow of the comment, but inasmuch as it is being used it is to say something like “current AIs aren’t trying to subvert our control, so it’s not completely implausible on the face of it that the first automated alignment researcher to which we delegate won’t try to subvert our control”, which is just a pretty weak claim and seems fine, and doesn’t imply any kind of extrapolation to superintelligence. I’d be surprised if this was an important disagreement with the “alignment is hard” crowd.
There are demos of models doing stuff like this (e.g. blackmail) but only under conditions selected highly adversarially. These look fragile enough that overall I’d still say current models are more aligned than e.g. rationalists (who under adversarially selected conditions have been known to intentionally murder people).
E.g. One naive threat model says “Orthogonality says that an AI system’s goals are completely independent of its capabilities, so we should expect that current AI systems have random goals, which by fragility of value will then be misaligned”. Setting aside whether anyone ever believed in such a naive threat model, I think we can agree that current models are evidence against such a threat model.
I’m claiming something like 3 (or 2, if you replace “given tremendous uncertainty, our best guess is” with “by assumption of the scenario”) within the very limited scope of the world where we assume AGI is right around the corner and looks basically just like current models but slightly smarter
It sees like the reason Claude’s level is misalignment is fine is because it’s capabilities aren’t very good, and there’s not much/any reason to assume it’d be fine if you held alignment constant but dialed up capabilities.
Do you not think that?
(I don’t really see why it’s relevant how aligned Claude is if we’re not thinking about that as part of it)
I don’t know what this means so I can’t give you a prediction about it.
I just named three reasons:
Is it relevant to the object-level question of “how hard is aligning a superintelligence”? No, not really. But people are often talking about many things other than that question.
For example, is it relevant to “how much should I defer to doomers”? Yes absolutely (see e.g. #1).
the premise that i’m trying to take seriously for this thought experiment is, what if the “claude is really smart and just a little bit away from agi” people are totally right, so that you just need to dial up capabilities a little bit more rather than a lot more, and then it becomes very reasonable to say that claude++ is about as aligned as claude.
(again, i don’t think this is a very likely assumption, but it seems important to work out what the consequences of this set of beliefs being true would be)
or at least, conditional on (a) claude is almost agi and (b) claude is mostly aligned, it seems like quite a strong claim to say “claude++ crosses the agi (= can kick off rsi) threshold at basically the same time it crosses the ‘dangerous-core-of-generalization’ threshold, so that’s also when it becomes super dangerous.” it’s way stronger a claim than “claude is far away from being agi, we’re going to make 5 breakthroughs before we achieve agi, so who knows whether agi will be anything like claude.” or, like, sure, the agi threshold is a pretty special threshold, so it’s reasonable to privilege this hypothesis a little bit, but when i think about the actual stories i’d tell about how this happens, it just feels like i’m starting from the bottom line first, and the stories don’t feel like the strongest part of my argument.
(also, i’m generally inclined towards believing alignment is hard, so i’m pretty familiar with the arguments for why aligning current models might not have much to do with aligning superintelligence. i’m not trying to argue that alignment is easy. or like i guess i’m arguing X->alignment is easy, which if you accept it, can only ever make you more likely to accept that alignment is easy than if you didn’t accept the argument, but you know what i mean. i think X is probably false but it’s plausible that it isn’t and importantly a lot of evidence will come in over the next year or so on whether X is true)
nod. I’m not sure I agreed with all the steps there but I agree with the general promise of “accept the premise that claude is just a bit away from AGI, and is reasonably aligned, and see where that goes when you look at each next step.”
I think you are saying something that shares at least some structure with Buck’s comment that
(But where you’re pointing at a different two sets of properties that may not arise at the same time)
I’m actually not sure I get what the two properties you’re talking about, though. Seems like you’re contrasting “claude++ crosses the agi (= can kick off rsi) threshold” with “crosses the ‘dangerous-core-of-generalization’ threshold”
I’m confused because I think the word “agi” basically does mean “cross the core-of-generalization threshold” (which isn’t immediately dangerous, but, puts us into ’things could quickly get dangerous at any time” territory)
I do agree “able to do a loop of RSI doesn’t intrinsically mean ‘agi’ or ‘core-of-generalization’,” there could be narrow skills for doing a loop of RSI. I’m not sure if you more meant “non-agi RSI” or, you see something different between “AGI” and “core-of-generalization.” Or think there’s a particular “dangerous core-of-generalization” separate from AGI.
(I think “the sharp left turn” is when the core-of-generalization starts to reflect on what it wants, which might come immediately after a core-of-generalization but also could come after either narrow-introspection + adhoc agency, or, might just take awhile for it to notice)
((I can’t tell if this comment is getting way more in the weeds than is necessary, but, it seemed like the nuances of exactly what you meant were probably loadbearing))
i guess so? i don’t know why you say “even as capability levels rise”—after you build and align the base case AI, humans are no longer involved in ensuring that the subsequent more capable AIs are aligned.
i’m mostly indifferent about what the paradigms look like up the chain. probably at some point up the chain things stop looking anything human made. but what matters at that point is no longer how good we humans are at aligning model n, but how good model n-1 is at aligning model n.
Fundamentally, it won’t be a single chain of ai’s aligning their successors, it will be a DAG with all sorts of selection effects with respect to which nodes get resources. Some subsets of the DAG will try to emulate single chains, via resource hoarding strategies, but this is not simple and won’t let them pretend they don’t need to hoard resources indefinitely.
If solving alignment implies solving difficult philosophical problems (and I think it does), then a major bottlenecks for verifying alignment will be verifying philosophy, which in turn implies that we should be trying to solve metaphilosophy (i.e., understand the nature of philosophy and philosophical reasoning/judgment). But that is unlikely to be possible within 2-4 years, even with the largest plausible effort, considering the history of analogous fields like metaethics and philosophy of math.
What to do in light of this? Try to verify the rest of alignment, just wing it on the philosophical parts, and hope for the best?
I kind of want to argue against this, but also am not sure how this fits in with the rest of your argument. Whether or not there’s an upper bound that’s plausibly a lot lower than perfectly solving alignment with certainty, it doesn’t seem to affect your final conclusions?
suppose a human solved alignment. how would we check their solution? ultimately, at the end of the day, we look at their argument and use our reasoning and judgement to determine that it’s correct. this applies even if we need to adopt a new frame of looking at the world—we can ultimately only use the kinds of reasoning and judgement we use today to decide which new kinds of reasoning to accept into the halls of truth.
so there is a philosophically very straightforward thing you could do to get the solution to alignment, or the closest we could ever get: put a bunch of smart thoughtful humans in a sim and run it for a long time. so the verification process looks like proving that the sim is correct, showing somehow that the humans are actually correctly uploaded, etc. not trivial but also seems plausibly doable.
Unless you can abstract out the “alignment reasoning and judgement” part of a human’s entire brain process (and philosophical reasoning and judgement as part of that) into some kind of explicit understanding of how it works, how do you actually build that into AI without solving uploading (which we’re obviously not on track to solve in 2-4 year either)?
Alignment researchers have had this thought for a long time (see e.g. Paul Christiano’s A formalization of indirect normativity) but I think all of the practical alignment research programs that this line of thought led to, such as IDA and Debate, are all still bottlenecked by lack of metaphilosophical understanding, because without the kind of understanding that lets you build an “alignment/philosophical reasoning checker” (analogous to a proof checker for mathematical reasoning) they’re stuck trying to do ML of alignment/philosophical reasoning from human data, which I think is unlikely to work out well.
I periodically say to people “if you want AI to be able to help directly with alignment research, it needs to be good at philosophy in a way it currently isn’t”.
Almost invariably, the person suggests training on philosophy textbooks, or philosophy academic work. And I sort of internally scream and externally say “no, or, at least, not without caveats.” (I think some academic philosophy “counts” as good training data, but, I feel like these people would not have good enough taste to tell the difference[1] and also “train on the text” seems obviously incomplete/insufficient/not-really-the-bottleneck)
I’ve been trying to replace philosophy with the underlying substance. I think I normally mean “precise conceptual reasoning”, but, reading this comment and remembering your past posts, I think you maybe mean something broader or different, but I’m not sure how to characterize it.
I think “figure out what are the right concepts to be use, and, use those concepts correctly, across all of relevant-Applied-conceptspace” is the expanded version of what I meant, which maybe feels more likely to be what you mean. But, I’m curious if you were to taboo “philosophy” what would you mean.
to be clear I don’t think I’d have good enough taste either
to be clear, “just train on philosophy textbooks lol” is not at all what i’m proposing, and i don’t think that works
Oh yeah I did not mean this to be a complaint about you.
Figuring out the underlying substance behind “philosophy” is a central project of metaphilosophy, which is far from solved, but my usual starting point is “trying to solve confusing problems which we don’t have established methodologies for solving” (methodologies meaning explicitly understood methods), which I think bakes in the least amount of assumptions about what philosophy is or could be, while still capturing the usual meaning of “philosophy” and explains why certain fields started off as being part of philosophy (e.g., science starting off as nature philosophy) and then became “not philosophy” when we figured out methodologies for solving them.
This bakes in “concepts” being the most important thing, but is that right? Must AIs necessarily think about philosophy using “concepts”, or is that really the best way to formulate how idealized philosophical reasoning should work?
Is “concepts” even what distinguishes philosophy from non-philosophical problems, or is “concepts” just part of how humans reason about everything, which we latch onto when trying to define or taboo philosophy, because we have nothing else better to latch onto? My current perspective is that what uniquely distinguishes philosophy is their confusing nature and the fact that we have no well-understood methods for solving them (but would of course be happy to hear any other perspectives on this).
Regarding good philosophical taste (or judgment), that is another central mystery of metaphilosophy, which I’ve been thinking a lot about but don’t have any good handles on. It seems like a thing that exists (and is crucial) but is very hard to see how/why it could exist or what kind of thing it could be.
So anyway, I’m not sure how much help any of this is, when trying to talk to the type of person you mentioned. The above are mostly some cached thoughts I have on this, originally for other purposes.
BTW, good philosophical taste being rare definitely seems like a very important part of the strategic picture, which potentially makes the overall problem insurmountable. My main hopes are 1) someone makes an unexpected metaphilosophical breakthrough (kind of like Satoshi coming out of nowhere to totally solve distributed currency) and there’s enough good philosophical taste among the AI safety community (including at the major labs) to recognize it and incorporate it into AI design or 2) there’s an AI pause during which human intelligence enhancement comes online and selecting for IQ increases the prevalence of good philosophical taste as a side effect (as it seems too much to hope that good philosophical taste would be directly selected for) and/or there’s substantial metaphilosophical progress during the pause.
What Leogao means is that we should increase the ability to verify alignment ALONG with the ability to verify capabilities. The latter is easy at least until the model comes up with a galaxy-brained scheme which allows it to leverage its failures to gain influence on the world. Verifying alignment or misalignment is rather simple with CoT-based AIs (especially if one decides to apply paraphrasers to prevent steganography), but not with neuralese AIs. The WORST aspect is that we are likely to come up with a super-capable architecture AT THE COST of our ability to understand the AI’s thoughts.
What I don’t understand is how alignment involves high-level philosophy and how a bad high-level philosophy causes astronomical waste. Astronomical waste is most likely to be caused by humanity or the AIs making a wrong decision on whether we should fill the planets in the lightcone with human colonies instead of lifeforms that could’ve grown on these planets, since this mechanism is already known to be possible.
I expect there’s a fair amount of low-hanging fruit in finding good targets for automated alignment research. E.g. how about an LLM agent which reads 1000s of old LW posts looking for a good target? How about unlearning? How about a version of RLHF where you show an alignment researcher two AI-generated critiques of an alignment plan, and they rate which critique is better?