I think trying to win the memetic war and trying to find the truth are fundamentally at odds with each other, so you have to find the right tradeoff. fighting the memetic war actively corrodes your ability to find the truth. this is true even if you constrain yourself to never utter any knowing falsehoods—even just arguing against the bad arguments over and over again calcifies your brain and makes you worse at absorbing new evidence and changing your mind. conversely, committing yourself to finding the truth means you will get destroyed when arguing against people whose only goal is to win arguments.
leogao
the premise that i’m trying to take seriously for this thought experiment is, what if the “claude is really smart and just a little bit away from agi” people are totally right, so that you just need to dial up capabilities a little bit more rather than a lot more, and then it becomes very reasonable to say that claude++ is about as aligned as claude.
(again, i don’t think this is a very likely assumption, but it seems important to work out what the consequences of this set of beliefs being true would be)
or at least, conditional on (a) claude is almost agi and (b) claude is mostly aligned, it seems like quite a strong claim to say “claude++ crosses the agi (= can kick off rsi) threshold at basically the same time it crosses the ‘dangerous-core-of-generalization’ threshold, so that’s also when it becomes super dangerous.” it’s way stronger a claim than “claude is far away from being agi, we’re going to make 5 breakthroughs before we achieve agi, so who knows whether agi will be anything like claude.” or, like, sure, the agi threshold is a pretty special threshold, so it’s reasonable to privilege this hypothesis a little bit, but when i think about the actual stories i’d tell about how this happens, it just feels like i’m starting from the bottom line first, and the stories don’t feel like the strongest part of my argument.
(also, i’m generally inclined towards believing alignment is hard, so i’m pretty familiar with the arguments for why aligning current models might not have much to do with aligning superintelligence. i’m not trying to argue that alignment is easy. or like i guess i’m arguing X->alignment is easy, which if you accept it, can only ever make you more likely to accept that alignment is easy than if you didn’t accept the argument, but you know what i mean. i think X is probably false but it’s plausible that it isn’t and importantly a lot of evidence will come in over the next year or so on whether X is true)
I agree. I think spending all of one’s time thinking about and arguing with weakman arguments is one of the top reasons why people get set in their ways and stop tracking the truth. I aspire not to do this
I think this is just because currently AI systems are viscerally dumb. I think almost all people will change their minds on this eventually. I think people are at greater risk of forgetting they ever believed AI is dumb, than denying AI capabilities long after they’re obviously superhuman.
I’m claiming something like 3 (or 2, if you replace “given tremendous uncertainty, our best guess is” with “by assumption of the scenario”) within the very limited scope of the world where we assume AGI is right around the corner and looks basically just like current models but slightly smarter
suppose a human solved alignment. how would we check their solution? ultimately, at the end of the day, we look at their argument and use our reasoning and judgement to determine that it’s correct. this applies even if we need to adopt a new frame of looking at the world—we can ultimately only use the kinds of reasoning and judgement we use today to decide which new kinds of reasoning to accept into the halls of truth.
so there is a philosophically very straightforward thing you could do to get the solution to alignment, or the closest we could ever get: put a bunch of smart thoughtful humans in a sim and run it for a long time. so the verification process looks like proving that the sim is correct, showing somehow that the humans are actually correctly uploaded, etc. not trivial but also seems plausibly doable.
i guess so? i don’t know why you say “even as capability levels rise”—after you build and align the base case AI, humans are no longer involved in ensuring that the subsequent more capable AIs are aligned.
i’m mostly indifferent about what the paradigms look like up the chain. probably at some point up the chain things stop looking anything human made. but what matters at that point is no longer how good we humans are at aligning model n, but how good model n-1 is at aligning model n.
to be clear, “just train on philosophy textbooks lol” is not at all what i’m proposing, and i don’t think that works
what i meant by that is something like:
assuming we are in this short-timelines-no-breakthroughs world (to be clear, this is a HUGE assumption! not claiming that this is necessarily likely!), to win we need two things: (a) base case: the first AI in the recursive self improvement chain is aligned, (b) induction step: each AI can create and align its successor.
i claim that if the base case AI is about as aligned as current AI, then condition (a) is basically either satisfied or not that hard to satisfy. like, i agree current models sometimes lie or are sycophantic or whatever. but these problems really don’t seem nearly as hard to solve as the full AGI alignment problem. like idk, you can just ask models to do stuff and they like mostly try their best, and it seems very unlikely that literal GPT-5 is already pretending to be aligned so it can subtly stab us when we ask it to do alignment research.
importantly, under our assumptions, we already have AI systems that are basically analogous to the base case AI, so prosaic alignment research on systems that exist today right now is actually just lots of progress on aligning the base case AI, and in my mind a huge part of the difficulty of alignment in the longer-timeline world is because we don’t yet have the AGI/ASI, so we can’t do alignment research with good empirical feedback loops.
like tbc it’s also not trivial to align current models. companies are heavily incentivized to do it and yet they haven’t succeeded fully. but this is a fundamentally easier class of problem than aligning AGI in longer-timelines world.
some thoughts on the short timeline agi lab worldview. this post is the result of taking capabilities people’s world models and mashing them into alignment people’s world models.
I think there are roughly two main likely stories for how AGI (defined as able to do any intellectual task as well as the best humans, specifically those tasks relevant for kicking off recursive self improvement) happens:
AGI takes 5-15 years to build. current AI systems are kind of dumb, and plateau at some point. we need to invent some kind of new paradigm, or at least make a huge breakthrough, to achieve AGI. how easily aligned current systems are is not strongly indicative of how easily aligned future AGI is; current AI systems are missing the core of intelligence that is needed.
AGI takes 2-4 years to build. current AI systems are really close and we just need more compute and schlep and minor algorithmic improvements. current AI systems aren’t exactly aligned, but they’re like pretty aligned, certainly they aren’t all secretly plotting our downfall as we speak.
while I usually think about story 1, this post is about taking story 2 seriously.
it seems basically true that current AI systems are mostly aligned, and certainly not plotting our downfall. like you get stuff like sycophancy but it’s relatively mild. certainly if AI systems were only ever roughly this misaligned we’d be doing pretty well.
the story is that once you have AGI, it builds and aligns its successor, which in turn builds and aligns its successor, etc. all the way up to superintelligence.
the problem is that at some link in the chain, you will have a model that can build its successor but not align it.
why is this the case? because progress on alignment is harder to verify than progress on capabilities, and this only gets more true as you ascend in capabilities. you can easily verify that superintelligence is superintelligent—ask it to make a trillion dollars (or put a big glowing X on the moon, or something). even if it’s tricked you somehow, like maybe it hacked the bank, or your brain, or something, it also takes a huge amount of capabilities to trick you on these things. however, verifying that it’s aligned requires distinguishing cases where it’s tricking you from cases where it isn’t, which is really hard, and only gets harder as the AI gets smarter.
though if you think about it, capabilities is actually not perfectly measurable either. pretraining loss isn’t all we care about; o3 et al might even be a step backwards on that metric. neither are capabilities evals; everyone knows they get goodharted to hell and back all the time. when AI solves all the phd level benchmarks nobody really thinks the AI is phd level. ok, so our intuition for capabilities measurement being easy is true only in the limit, but not necessarily on the margin.
we have one other hope, which is that maybe we can just allocate more of the resources to solving alignment. it’s not immediately obvious how to do this if the fundamental bottleneck is verifiability—even if you (or to be more precise, the AI) keep putting in more effort, if you have no way of telling what is good alignment research, you’re kind of screwed. but one thing is that you can demand things that are strictly stronger than alignment, that are easier to verify. if this is possible, then you can spend a larger fraction of your computer on alignment to compensate.
in particular, because ultimately the only way we can make progress on alignment is by relying on whatever process for deciding that research is good that human alignment researchers use in practice (even provably correct stuff has the step where we decide what theorem to prove and give an argument for why that theorem means our approach is sound), there’s an upper bound on the best possible alignment solution that humans could ever have achieved, which is plausibly a lot lower than perfectly solving alignment with certainty. and it’s plausible that there are alignment equivalents to “make a trillion dollars” for capabilities that are easy to verify, strictly imply alignment, and extremely difficult to get any traction on (and with it, a series of weakenings of such a metric that are easier to get traction on but also less-strictly imply alignment). one hope is maybe this looks something like an improved version of causal scrubbing + a theory of heuristic arguments, or something like davidad’s thing.
takeaways (assuming you take seriously the premise of very short timelines where AGI looks basic like current AI): first, I think it implies that we should try to figure out how to reduce the asymmetry in verifiability between capabilities and alignment. second, it updates me to being less cynical about work making current models aligned—I used to be very dismissive of this work as “not real alignment” but it does seem decently important in this world.
AI denial will almost certainly exist but i think it’s unlikely AI denial becomes a big thing, in the same way that flat erth denial is basically irrelevant in the world.
I personally find that the technical problems in capabilities are usually more appealing to me than the ones in math purely in terms of funness. they are simply different kinds of problems that appeal to different people.
bonus types of guy:
works on capabilities because if we don’t solve ASI soon then the AI hype all comes crashing down and creates a new AI winter
works on capabilities because ASI is inevitable and it’s cool to be part of the trajectory of history
works on capabilities because there’s this really cool idea that they’ve always dreamed of implementing
works on capabilities because it’s cool being able to say they contributed to something that is used by millions
works on capabilities because the technical problems are really interesting and fun
works on capabilities because alignment is a capabilities problem
works on capabilities because they expect it to be super glorious to discover big breakthroughs in capabilities
also, even if we don’t set anything like this up now, you should expect that in worlds where alignment gets solved and ASI is built, people will probably consider it a wise decision to retroactively reward people who made things go well in general based on their calculated impact.
(i’m also default-skeptical of anything to do with crypto in any way whatsoever for the purposes of building lasting institutions. i don’t care how good a DAO could be in theory, the average DAO has a lifespan shorter than the average SF startup)
i find it funny that i know people in all 4 of the following quadrants:
works on capabilities, and because international coordination seems hopeless, we need to race to build ASI first before the bad guys
works on capabilities, and because international coordination seems possible, and all national leaders like to preserve the status quo, we need to build ASI before it gets banned
works on safety, and because international coordination seems hopeless, we need to solve the technical problem before ASI kills everyone
works on safety, and because international coordination seems possible, so we need to focus on regulation and policy before ASI kills everyone
idea: popular authors should occasionally write preregistered intentionally subtly bad posts as an epistemic health check
ofc, most people are unthoughtful and will just give bad criticisms of any idea whatsoever. I was just saying why this is true even if you’re talking to substantially above average thoughtfulness people.
i strongly agree that this is a thing that happens. even people who are thoughtful in general will have bad opinions on a new thing they haven’t thought a lot about. even people who are going off an illegible but empirically very reliably correct heuristic will come of as having bad opinions. i think getting entrenched in one’s positions is an epistemically corrosive effect of spending too much time doing advocacy / trying to persuade people.
I’m generally a very forgetful person. I forget people’s names, my keys, my luggage, 2fa codes I saw 3 seconds ago, etc all the time. but for some reason I’ve never forgotten my hotel room number and needed to consult the written down number. this is weird because it’s an arbitrary number that I’m given once and have to remember for a few days.
i mean like writing kernels or hill climbing training metrics is viscerally fun even separate from any of the status parts. i know because long before any of this ai safety stuff, before ai was such a big deal, i would do ML stuff literally purely for fun without getting paid or trying to achieve glorious results or even publishing it anywhere for anyone else to see.