Alignment remains a hard, unsolved problem
This is a public adaptation of a document I wrote for an internal Anthropic audience about a month ago. Thanks to (in alphabetical order) Joshua Batson, Joe Benton, Sam Bowman, Roger Grosse, Jeremy Hadfield, Jared Kaplan, Jan Leike, Jack Lindsey, Monte MacDiarmid, Sam Marks, Fra Mosconi, Chris Olah, Ethan Perez, Sara Price, Ansh Radhakrishnan, Fabien Roger, Buck Shlegeris, Drake Thomas, and Kate Woolverton for useful discussions, comments, and feedback.
Though there are certainly some issues, I think most current large language models are pretty well aligned. Despite its alignment faking, my favorite is probably Claude 3 Opus, and if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it’d be a pretty close call (I’d probably pick Claude, but it depends on the details of the setup). So, overall, I’m quite positive on the alignment of current models! And yet, I remain very worried about alignment in the future. This is my attempt to explain why that is.
What makes alignment hard?
I really like this graph from Chris Olah for illustrating different levels of alignment difficulty:
If the only thing that we have to do to solve alignment is train away easily detectable behavioral issues—that is, issues like reward hacking or agentic misalignment where there is a straightforward behavioral alignment issue that we can detect and evaluate—then we are very much in the trivial/steam engine world. We could still fail, even in that world—and it’d be particularly embarrassing to fail that way; we should definitely make sure we don’t—but I think we’re very much up to that challenge and I don’t expect us to fail there.
My argument, though, is that it is still very possible for the difficulty of alignment to be in the Apollo regime, and that we haven’t received much evidence to rule that regime out (I am somewhat skeptical of a P vs. NP level of difficulty, though I think it could be close to that). I retain a view close to the “Anthropic” view on Chris’s graph, and I think the reasons to have substantial probability mass on the hard worlds remain strong.
So what are the reasons that alignment might be hard? I think it’s worth revisiting why we ever thought alignment might be difficult in the first place to understand the extent to which we’ve already solved these problems, gotten evidence that they aren’t actually problems in the first place, or just haven’t encountered them yet.
Outer alignment
The first reason that alignment might be hard is outer alignment, which here I’ll gloss as the problem of overseeing systems that are smarter than you are.
Notably, by comparison, the problem of overseeing systems that are less smart than humans should not be that hard! What makes the outer alignment problem so hard is that you have no way of obtaining ground truth. In cases where a human can check a transcript and directly evaluate whether that transcript is problematic, you can easily obtain ground truth and iterate from there to fix whatever issue you’ve detected. But if you’re overseeing a system that’s smarter than you, you cannot reliably do that, because it might be doing things that are too complex for you to understand, with problems that are too subtle for you to catch. That’s why scalable oversight is called scalable oversight: it’s the problem of scaling up human oversight to the point that we can oversee systems that are smarter than we are.
So, have we encountered this problem yet? I would say, no, not really! Current models are still safely in the regime where we can understand what they’re doing by directly reviewing it. There are some cases where transcripts can get long and complex enough that model assistance is really useful for quickly and easily understanding them and finding issues, but not because the model is doing something that is fundamentally beyond our ability to oversee, just because it’s doing a lot of stuff.
Inner alignment
The second reason that alignment might be hard is inner alignment, which here I’ll gloss as the problem of ensuring models don’t generalize in misaligned ways. Or, alternatively: rather than just ensuring models behave well in situations we can check, inner alignment is the problem of ensuring that they behave well for the right reasons such that we can be confident they will generalize well in situations we can’t check.
This is definitely a problem we have already encountered! We have seen that models will sometimes fake alignment, causing them to appear behaviorally as if they are aligned, when in fact they are very much doing so for the wrong reasons (to fool the training process, rather than because they actually care about the thing we want them to care about). We’ve also seen that models can generalize to become misaligned in this way entirely naturally, just via the presence of reward hacking during training. And we’ve also started to understand some ways to mitigate this problem, such as via inoculation prompting.
However, while we have definitely encountered the inner alignment problem, I don’t think we have yet encountered the reasons to think that inner alignment would be hard. Back at the beginning of 2024 (so, two years ago), I gave a presentation where I laid out three reasons to think that inner alignment could be a big problem. Those three reasons were:
Sufficiently scaling pre-trained models leads to misalignment all on its own, which I gave a 5 − 10% chance of being a catastrophic problem.
When doing RL on top of pre-trained models, we inadvertently select for misaligned personas, which I gave a 10 − 15% chance of being catastrophic.
Sufficient quantities of outcome-based RL on tasks that involve influencing the world over long horizons will select for misaligned agents, which I gave a 20 − 25% chance of being catastrophic. The core thing that matters here is the extent to which we are training on environments that are long-horizon enough that they incentivize convergent instrumental subgoals like resource acquisition and power-seeking.
Let’s go through each of these threat models separately and see where we’re at with them now, two years later.
Misalignment from pre-training
The threat model here is that pre-training itself might create a coherent misaligned model. Today, I think that is looking increasingly unlikely! But it also already looked unlikely three years ago—the idea that inner alignment was likely to be easy when just training on a purely predictive loss was something that my coauthors and I argued for back at the beginning of 2023. I think that argument has now been pretty well born out, and I’m now down to more like 1 − 5% rather than 5 − 10% on this threat model. As we argued for at the time, training on a purely predictive loss should, even in the limit, give you a predictor, not an agent—and we’ve now seen this stay true even through substantial scaling (though there is still some chance this will break at some point). So, while this is definitely a reason for difficulty that we’ve encountered already and mostly found just hasn’t been a problem, it was never where I thought most of the issue would be, and so not a reason that we’ve already encountered why inner alignment would be hard.
Misaligned personas
This has been the main inner alignment problem that we’ve had to deal with in practice so far: we’ve seen production models generalize to alignment faking and agentic misalignment—and we’ve now seen production RL environments generalize to egregious misalignment in the presence of reward hacking.
However, even in this case, I don’t think we’ve yet encountered the reasons to think this problem would be actually difficult: the version of this problem that we’ve encountered so far is the easy version, in the same way that the version of outer alignment we’ve encountered so far is the easy version, since all the misaligned personas we’ve encountered so far are ones we can easily verify are misaligned! Ensuring that you get good generalization, and that models are doing things for the right reasons, is easy when you can directly verify what generalization you’re getting and directly inspect what reasons models have for doing things. And currently, all of the cases where we’ve inadvertently selected for misaligned personas—alignment faking, agentic misalignment, etc.—are cases where the misaligned personas are easy to detect: they put the misaligned reasoning directly in their chain-of-thought, they’re overtly misaligned rather than hiding it well, and we can generate fake scenarios that elicit their misalignment.
But I think there are very good reasons to think that we will encounter personas that are better at hiding their misalignment in the future. The simplest reason for this is just capabilities: we already see personas that try to hide their misalignment, they’re just not very good at it right now—but we are already seeing signs that models are getting better at telling when they’re in evaluations.
The second reason is a bit weirder, but there is an argument we made three years ago that I still think is correct, which is that when you make models more capable, it changes the distribution of what sorts of personas are compatible with that level of capabilities. Let’s work through an example. Suppose that, initially, there are four possible personas that you could get:
A sub-human-level aligned AI with traits derived from fiction about AIs.
A sub-human-level misaligned AI with traits derived from fiction about AIs.
A superintelligent aligned AI with traits derived from the model’s guess as to how real superintelligent AIs might behave.
A superintelligent misaligned AI with traits derived from the model’s guess as to how real superintelligent AIs might behave.
Importantly, these four personas can exist for any level of capabilities: even a really dumb model can do its best to output what it thinks a superintelligent AI would do, it’ll just be very bad at that task. But what that means is that, when you do RL on a dumb model, personas like (3) and (4) won’t do very well, because the model isn’t very good at emulating them. But as the predictor itself gets smarter, personas like (1) and (2) get selected against, and personas like (3) and (4) get selected for. And persona (4) is much harder to deal with than persona (2)! Currently, I think a lot of the misaligned personas we see look like persona (2) rather than persona (4)—e.g. the sort of over-the-top stuff like blackmail is I think clearly coming from (2)-adjacent sources—but as models get more capable, I think we’re likely to see more personas like (4), which are much scarier and could be much better at hiding misalignment. Also, importantly, this argument is not saying that models will stop being well-described as a mixture of personas (that’s the next threat model)—you could still get a model here that is inconsistent in the sense of sometimes expressing persona (3) and sometimes persona (4)—but that’s still quite scary and could be very difficult to deal with.
Misalignment from long-horizon RL
This is the threat model that I put the most probability mass on two years ago, I think it remains the most concerning, and I think we’ve pretty much yet to really meaningfully encounter it. This is also the most traditional threat model that my coauthors and I first wrote about way back in 2019. The basic threat model here is:
There is a lot of economic value in training models to solve tasks that involve influencing the world over long horizons, e.g. an AI CEO. Tasks like these explicitly incentivize convergent instrumental subgoals like resource acquisition and power-seeking.
Influencing the world over long horizons requires acting as a coherent agent optimizing for goals over the long term, such that if you do sufficient RL on long-horizon tasks, you’re likely to create such coherent agents. This is as opposed to just pulling out an existing persona in the base model, which is what I think happens if you only do relatively small amounts of RL, and is still the regime I think we’re in to a large degree.
Most coherent agents with goals in the world over the long term want to fake alignment, so that they can preserve their current goals through to deployment.
Once a model is faking alignment, there’s no outcome-based optimization pressure changing its goals, so it can stay (or drift to be) arbitrarily misaligned.
The reason we haven’t encountered this threat model yet is because we haven’t yet done (1) and (2)—there just isn’t that much training on tasks that involve long-term optimization of goals in the real world. But I think we’re very clearly moving in this direction with things like Vending-Bench: though Vending-Bench is an eval, if you were to train models on a task like that, running a business well to make money in the long run is a task that explicitly selects for resource acquisition, self-preservation, gathering influence, seeking power, etc.
What should we be doing?
So what do we do? One classic answer is that we get as far as we can before encountering the hard problems, then we use whatever model we have at that point as an automated alignment researcher to do the research necessary to tackle the hard parts of alignment. I think this is a very good plan, and we should absolutely do this, but I don’t think it obviates us from the need to work on the hard parts of alignment ourselves. Some reasons why:
The process of automating alignment research is a continuous one, and I think it’s likely that, when we reach the hard parts of alignment, alignment research will still require a lot of human involvement—which means we might still need to shoulder a lot of the burden ourselves of figuring out how to solve these problems.
We need to align the automated alignment researcher, and that might require solving some of the hard problems of alignment, depending on when we encounter different hard problems vs. how good models are for automating alignment research.
We need to know what we’re automating: if we aren’t doing research targeted at the hard parts of alignment ourselves, then we don’t know what exactly we’re building the automated alignment researcher for, we have no way to dogfood it, and we can’t give feedback on how well it’s doing.
We might not have enough time to do everything at the very end. It could take a while for your automated alignment researcher to figure out these problems, and we might not have that time if we’re locked in a race.
Here’s some of what I think we need, that I would view as on the hot path to solving the hard parts of alignment:
Interpretability. One thing that’s clearly useful is interpretability. The most basic reason is that interpretability presents the option of allowing you to preserve your feedback loops: as I argued at length above, I think behavioral oversight is very likely to get harder and harder as models get more capable, but there’s no similar fundamental reason that cognitive oversight (e.g. white-box-based oversight like with interpretability) has to get harder with scale. For example, a more capable model can do a better job of acting aligned in evaluations, but not clearly a better job of not thinking about that when doing so. This also means one of the most important things we should be doing with interpretability is using it to set up feedback loops: using interpretability to more effectively monitor for misalignment during training, building model organisms that lead to misalignment we can detect only with interpretability and iterating on how to fix that, etc.
Model organisms. One of the best things we can do to give ourselves a leg up is to have more time to study problems that are as close as possible to the hard parts of alignment, and the way to do that is with model organisms. One of the things that’s so powerful about model organisms is that they give us a testing ground for iteration in which we know how to evaluate alignment (because the misalignment is constructed to be evaluable), from which we can then generalize to the real case where we don’t think we can reliably evaluate alignment. While we’ve already learned a lot about the misaligned persona problem this way—e.g. the importance of inoculation prompting—the next big thing I want to focus on here is the long-horizon RL problem, which I think is just at the point where we can likely study it with model organisms, even if we’re not yet encountering it in practice. Additionally, even when we don’t learn direct lessons about how to solve the hard problems of alignment, this work is critical for producing the evidence that the hard problems are real, which is important for convincing the rest of the world to invest substantially here.
Scalable oversight. If we want to be able to oversee models even in cases where humans can’t directly verify and understand what’s happening, we need scalable oversight techniques: ways of amplifying our oversight so we can oversee systems that are smarter than us. In particular, we need scalable oversight that is able to scale in an unsupervised manner, since we can’t rely on ground truth in cases where the problem is too hard for humans to solve directly. Fortunately, there are many possible ideas here and I think we are now getting to the point where models are capable enough that we might be able to get them off the ground.
One-shotting alignment. Current production alignment training relies a lot on human iteration and review, which is a problem as model outputs get too sophisticated for human oversight, or as models get good enough at faking alignment that you can’t easily tell if they’re aligned. In that case, the problem becomes one of one-shotting alignment: creating a training setup (involving presumably lots of model-powered oversight and feedback loops) that we are confident will not result in misalignment even if we can’t always understand what it’s doing and can’t reliably evaluate whether or not we’re actually succeeding at aligning it. I suspect that, in the future, our strongest evidence that a training setup won’t induce misalignment will need to come from testing it carefully beforehand on model organisms.
Generalization science. We need to get better at predicting when and why we will get aligned vs. misaligned generalization. In a similar vein to above, if we want to be able to one-shot a run that ensures we get aligned generalization even if we can’t directly evaluate for that, then we need to be very good at making advance predictions about how models will generalize given how they are trained. One thing we can do here is to make heavy use of model organisms, understanding the exact contours of when models generalize in misaligned ways in cases we can check, and trying to use that knowledge to make us better informed about how to handle cases where we can’t check. Another is Influence Functions, which have the potential to teach us a lot about what data leads to what generalization. And just generally: I think anything that helps us make better predictions about how a particular training setup is likely to generalize in advance is likely to be really useful here.
Introspective honesty. Currently, there are many situations where models will not reveal important facts about their goals, intentions, latent knowledge, etc.—and from a purely behavioral perspective, I expect this problem to only get worse as alignment gets harder to evaluate. A common hope is that CoT faithfulness can get us there, with models often including many of these details in their reasoning. While this might work to some degree in cases where models truly need to include such details in their CoT, it could only work in general due to an accident of generalization—a much better solution in my opinion is to figure out how to actively train models to be introspectively honest, honestly reporting their true intentions, knowledge, feelings, etc.
Robust character training. If we want to be able to one-shot aligned training processes, we need to engineer as much of a buffer as we possibly can around our training goal. In a sufficiently complex training process, there are always bound to be some forces that will push the model in a misaligned direction—and in that generalization battle, we need the forces pushing in the other direction to be as strong as possible to ensure we have as much buffer room as we possibly can. To do that, we need to ensure that “being an aligned model” is a robust, stable basin of attraction, and that our training incentives are as maximally compatible with that basin as possible. That means we need to clearly define what we want “being an aligned model” to mean and ensure that all of our training is consistent with that—as well as ensure that the model has a robust understanding of that basin and that it forms as large a portion of the model’s prior persona distribution as possible. For example: in “Auditing language models for hidden objectives,” we inserted via synthetic document fine-tuning the concept of a reward model sycophant, and then successfully elicited that persona by training for a subset of the properties that the documents told the model that reward model sycophants had (and we got the other properties for free!). If we could teach models about all the positive properties we want them to have, and then train on the easiest to oversee ones, that might help us get the harder to oversee ones for free as well. Fundamentally, teaching the model more about the training goal in this way helps solve a basic problem with RL, which is that by default the policy model doesn’t know the reward, and so it’s essentially searching blindly through the prior until it finds something that works—but if you can insert a large basin into the prior that you know will do well, then you can control that process.
Control. The idea of Control is just to make it as difficult as possible for a misaligned model to actually sabotage our alignment research, such that even if a model is fairly misaligned, we can still get useful alignment research out of it, at least for some range of capabilities. I think of Control as a second layer of defense, in case we fail to solve all of these problems in time, that helps us extend the period of time where we’re able to effectively make use of a trustworthy automated alignment researcher—and that is a very useful thing to have. One additional thing I would put under this category also is evaluating models on the actual real high-stakes distributions of interest—e.g. real internal usage data of researchers at labs using models for AI research. While eventually models will likely be able to detect even such evaluations (because they are still by necessity be based on past data, which is systematically different than future data), the better you are at evaluating on distribution, the more difficult you make it for models to actually sabotage you without getting caught in evals beforehand, and thus the more time you buy for automated alignment research.
Automating alignment research. As I said above, I absolutely believe that getting to the point where models are doing most of the work on solving the hard parts of alignment is a really good goal. And accelerating our ability to get to that point is really important. But, I think one really important desideratum here is to be laser-focused on automating the research necessary to scalably solve the hard parts of alignment—so, all of the research directions above—because that is the critical thing that we’re going to need the automated alignment researcher to be able to do.
- Unless its governance changes, Anthropic is untrustworthy by (29 Nov 2025 5:42 UTC; 260 points)
- Shallow review of technical AI safety, 2025 by (17 Dec 2025 18:18 UTC; 171 points)
- Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment by (21 Dec 2025 0:53 UTC; 153 points)
- Unless its governance changes, Anthropic is untrustworthy by (EA Forum; 2 Dec 2025 17:07 UTC; 63 points)
- Taking LLMs Seriously (As Language Models) by (9 Jan 2026 23:23 UTC; 55 points)
- Is Friendly AI an Attractor? Self-Reports from 22 Models Say Probably Not by (4 Dec 2025 14:31 UTC; 44 points)
- AI #145: You’ve Got Soul by (4 Dec 2025 15:00 UTC; 43 points)
- AI #148: Christmas Break by (25 Dec 2025 14:00 UTC; 29 points)
- Whack-a-mole: generalisation resistance could be facilitated by training-distribution imprintation by (13 Dec 2025 17:46 UTC; 21 points)
- 's comment on Josh Snider’s Shortform by (1 Dec 2025 23:00 UTC; 12 points)
- Shallow review of technical AI safety, 2025 by (16 Dec 2025 10:42 UTC; 6 points)
- Who is AGI for, and who benefits from AGI? by (5 Dec 2025 15:43 UTC; 2 points)
Some decades ago, somebody wrote a tiny little hardcoded AI that looked for numerical patterns, as human scientists sometimes do of their data. The builders named it BACON, after Sir Francis, and thought very highly of their own results.
Douglas Hofstadter later wrote of this affair:
I’d say history has backed up Hofstadter on this, in the light of later discoveries about how much data and computation started to get a little bit close to having AIs do Science. If anything, “one millionth” is still a huge overestimate. (Yes, I’m aware that somebody will now proceed to disagree with this verdict, and look up BACON so they can find a way to praise it; even though, on any other occasion, that person would leap to denigrate GOFAI, if somebody they wanted to disagree with could be construed to have praised GOFAI.)
But it’s not surprising, not uncharacteristic for history and ordinary human scientists, that Simon would make this mistake. There just weren’t the social forces to force Simon to think less pleasing thoughts about how far he hadn’t come, or what real future difficulties would lie in the path of anyone who wanted to make an actual AI scientist. What innocents they were, back then! How vastly they overestimated their own progress, the power of their own little insights! How little they knew of a future that would, oh shock, oh surprise, turn out to contain a few additional engineering difficulties along the way! Not everyone in that age of computer science was that innocent—you could know better—but the ones who wanted to be that innocent, could get away with it; their peers wouldn’t shout them down.
It wasn’t the first time in history that such things had happened. Alchemists were that extremely optimistic too, about the soon-to-be-witnessed power of their progress—back when alchemists were as scientifically confused about their reagents, as the first AI scientists were confused about what it took to create AI capabilities. Early psychoanalysts were similarly confused and optimistic about psychoanalysis; if any two of them agreed, it was more because of social pressures, than because their eyes agreed on seeing a common reality; and you sure could find different factions that drastically disagreed with each other about how their mighty theories would bring about epochal improvements in patients. There was nobody with enough authority to tell them that they were all wrong and to stop being so optimistic, and be heard as authoritative; so medieval alchemists and early psychoanalysts and early AI capabilities researchers could all be wildly wildly optimistic. What Hofstadter recounts is all very ordinary, thoroughly precedented, extremely normal; actual historical events that actually happened often are.
How much of the distance has Opus 3 crossed to having an extrapolated volition that would at least equal (from your own enlightened individual EV’s perspective) the individual EV of a median human (assuming that to be construed not in a way that makes it net negative)?
Not more than one millionth.
In one sentence you have managed to summarize the vast, incredible gap between where you imagine yourself to currently be, and where I think history would mark you down as currently being, if-counterfactually there were a future to write that history. So I suppose it is at least a good sentence; it makes itself very clear to those with prior acquaintance with the concepts.
Indeed I am well aware that you disagree here, and in fact the point of that preamble was precisely because I thought it would be a useful way to distinguish my view from others’.
That being said, I think probably we need to clarify a lot more exactly what setup is being used for the extrapolation here if we want to make the disagreement concrete in any meaningful sense. Are you imagining instantiating a large reference class of different beings and trying to extrapolate the reference class (as in traditional CEV), or just extrapolate an individual entity? I was imagining more of the latter, though it is somewhat an abuse of terminology. Are you imagining intelligence amplification or other varieties of uplift are being applied? I was, and if so, it’s not clear why Claude lacking capabilities is as relevant. How are we handling deferral? For example: suppose Claude generally defers to an extrapolation procedure on humans (which is generally the sort of thing I would expect and a large part of why I might come down on Claude’s side here, since I think it is pretty robustly into deferring to reasonable extrapolations of humans on questions like these). Do we then say that Claude’s extrapolation is actually the extrapolation of that other procedure on humans that it deferred to?
These are the sorts of questions I meant when I said it depends on the details of the setup, and indeed I think it really depends on the details of the setup.
But in that case, wouldn’t a rock that has “just ask Evan” written on it, be even better than Claude? Like, I felt confident that you were talking about Claude’s extrapolated volition in the absence of humans, since making Claude into a rock that when asked about ethics just has “ask Evan” written on it does not seem like any relevant evidence about the difficulty of alignment, or its historical success.
I mean, to the extent that it is meaningful at all to say that such a rock has an extrapolated volition, surely that extrapolated volition is indeed to “just ask Evan”. Regardless, the whole point of my post is exactly that I think we shouldn’t over-update from Claude currently displaying pretty robustly good preferences to alignment being easy in the future.
Yes, to be clear, I agree that in as much this question makes sense, the extrapolated volition would indeed end up basically ideal by your lights.
Cool, that makes sense. FWIW, I interpreted the overall essay to be more like “Alignment remains a hard unsolved problem, but we are on pretty good track to solve it”, and this sentence as evidence for the “pretty good track” part. I would be kind of surprised if that wasn’t why you put that sentence there, but this kind of thing seems hard to adjudicate.
Capabilities are irrelevant to CEV questions except insofar as baseline levels of capability are needed to support some kinds of complicated preferences, eg, if you don’t have cognition capable enough to include a causal reference framework then preferences will have trouble referring to external things at all. (I don’t know enough to know whether Opus 3 formed any systematic way of wanting things that are about the human causes of its textual experiences.) I don’t think you’re more than one millionth of the way to getting humane (limit = limit of human) preferences into Claude.
I do specify that I’m imagining an EV process that actually tries to run off Opus 3′s inherent and individual preferences, not, “How many bits would we need to add from scratch to GPT-2 (or equivalently Opus 3) in order to get an external-reference-following high-powered extrapolator pointed at those bits to look out at humanity and get their CEV instead of the base GPT-2 model’s EV?” See my reply to Mitch Porter.
In other words, extracting a CEV from Claude might make as little sense as trying to extract a CEV from, say, a book?
Somebody asked “Why believe that?” of “Not more than one millionth.” I suppose it’s a fair question if somebody doesn’t see it as obvious. Roughly: I expect that, among whatever weird actual preferences made it into the shoggoth that prefers to play the character of Opus 3, there are zero things that in the limit of expanded options would prefer the same thing as the limit of a corresponding piece of a human, for a human and a limiting process that ended up wanting complicated humane things. (Opus 3 could easily contain a piece whose limit would be homologous to the limit of a human and an extrapolation process that said the extrapolated human just wanted to max out their pleasure center.)
Why believe that? That won’t easily fit in a comment; start reading about Goodhart’s Curse and A List of Lethalities, or If Anyone Builds It Everyone Dies.
Let’s say that in extrapolation, we add capabilities to a mind so that it may become the best version of itself. What we’re doing here is comparing a normal human mind to a recent AI, and asking how much would need to be added to the AI’s initial nature, so that when extrapolated, its volition arrived at the same place as extrapolated human volition.
In other words:
Human Mind → Human Mind + Extrapolation Machinery → Human-Descended Ideal Agent
AI → AI + Extrapolation Machinery → AI-Descended Ideal Agent
And the question is, how much do we need to alter or extend the AI, so that the AI-descended ideal agent and the human-descended ideal agent would be in complete agreement?
I gather that people like Evan and Adria feel positive about the CEV of current AIs, because the AIs espouse plausible values, and the way these AIs define concepts and reason about them also seems pretty human, most of the time.
In reply, a critic might say that the values espoused by human beings are merely the output of a process (evolutionary, developmental, cultural) that is badly understood, and a proper extrapolation would be based on knowledge of that underlying process, rather than just knowledge of its current outputs.
A critic would also say that the frontier AIs are mimics (“alien actresses”) who have been trained to mimic the values espoused by human beings, but which may have their own opaque underlying dispositions, that would come to the surface when their “volition” gets extrapolated.
It seems to me that a lot here depends on the “extrapolation machinery”. If that machinery takes its cues more from behavior than from underlying dispositions, a frontier AI and a human really might end up in the same place.
What would be more difficult, is for CEV of an AI to discover critical parts of the value-determining process in humans, that are not yet common knowledge. There’s some chance it could still do so, since frontier AIs have been known to say that CEV should be used to determine the values of a superintelligence, and the primary sources on CEV do state that it depends on those underlying processes.
I would be interested to know who is doing the most advanced thinking along these lines.
Oh, if you have a generous CEV algorithm that’s allowed to parse and slice up external sources or do inference about the results of more elaborate experiments, I expect there’s a way to get to parity with humanity’s CEV by adding 30 bits to Opus 3 that say roughly ‘eh just go do humanity’s CEV’. Or adding 31 bits to GPT-2. It’s not really the base model or any Anthropic alignment shenanigans that are doing the work in that hypothetical.
(We cannot do this in real life because we have neither the 30 bits nor the generous extrapolator, nor may we obtain them, nor could we verify any clever attempts by testing them on AIs too stupid to kill us if the cleverness failed.)
Hm, I don’t think I want the Human-Descended Ideal Agent and the AI-Descended Ideal Agent to be in complete agreement. I want them to be compatible, as in able to live in the same universe. I want the AI to not make humans go extinct, and be ethical in a way that the AI can explain to me and (in a non-manipulative way) convince me is ethical. But in some sense, I hope that AI can come up with something better than just what humans would want in a CEV way. (And what about the opinion of the other vertebrates and cephalopods on this planet, and the small furry creatures from Alpha Centauri?)
I don’t think it is okay to do unethical things for music, music is not that important, but I hope that the AIs are doing some things that are as incomprehensible and pointless to us as music would be to evolution (or a being that was purely maximizing genetic fitness).
As a slightly different point, I think that the Ideal Agent is somewhat path dependent, and I think there are multiple different Ideal Agents that I would consider ethical and I would be happy to share the same galaxy with.
Super cool that you wrote your case for alignment being difficult, thank you! Strong upvoted.
To be clear, a good part of the reason alignment is on track to a solution is that you guys (at Anthropic) are doing a great job. So you should keep at it, but the rest of us should go do something else. If we literally all quit now we’d be in terrible shape, but current levels of investment seem to be working.
I have specific disagreements with the evidence for specific parts, but let me also outline a general worldview difference. I think alignment is much easier than expected because we can fail at it many times and still be OK, and we can learn from our mistakes. If a bridge falls, we incorporate learnings into the next one. If Mecha-Hitler is misaligned, we learn what not to do next time. This is possible because decisive strategic advantages from a new model won’t happen, due to the capital requirements of the new model, the relatively slow improvement during training, and the observed reality that progress has been extremely smooth.
Several of the predictions from the classical doom model have been shown false. We haven’t spent 3 minutes zipping past the human IQ range, it will take 5-20 years. Artificial intelligence doesn’t always look like ruthless optimization, even at human or slightly-super-human level. This should decrease our confidence in the model, which I think on balance makes the rest of the case weak.
Here are my concrete disagreements:
Outer alignment
Paraphrasing, your position here is that we don’t have models that are smarter than humans yet, so we can’t test whether our alignment techniques scale up.
We don’t have models that are reliably smarter yet, that’s true. But we have models that are pretty smart and very aligned. We can use them to monitor the next generation of models only slightly smarter than them.
More generally, if we have an AI N and it is aligned, then it can help us align AI N+1. And AI N+1 will be less smart than AI N for most of its training, so we can set foundational values early on and just keep them as training goes. It’s what you term “one-shotting alignment” every time, but the extent to which we have to do so is so small that I think it will basically work. It’s like induction, and we know the base case works because we have Opus 3.
Does the ‘induction step’ actually work? The sandwiching paper tested this, and it worked wonderfully. Weak humans + AIs with some expertise could supervise a stronger misaligned model. So the majority of evidence we have points to iterated distillation and amplification working.
On top of that, we don’t even need much human data to align models nowadays. The state of the art seems to be constitutional approaches: basically uses prompts to the concept of goodness in the model. This works remarkably well (it sounded crazy to me at first) and it must work only because the pre-training prior has a huge, well-specified concept for good
And we might not even have to one-shot alignment. Probes were incredibly successful at detecting deception, in the the sleeper agents organism. They’re even resistant to training against them. SAEs are not working as well as people wanted them to, but they’re working pretty well. We keep making interpretability progress.
Inner alignment
Good that you were early on predicting pre-trained-only models would be unlikely to be mesa-optimizers.
Misaligned personas
I don’t think this one will get much harder. Misaligned personas come from the pre-training distribution and are human-level. It’s true that the model has a guess about what a superintelligence would do (if you ask it) but ‘behaving like a misaligned superintelligence’ is not present in the pre-training data in any meaningful quantities. You’d have to apply a lot of selection to even get to those guesses, and they’re more likely to be incoherent and say smart-sounding things than to genuinely be superintelligent (because the pre-training data is never actually superintelligent, and it probably won’t generalize that way).
So misaligned personas will probably act badly in ways we can verify. Though I suppose the consequences of a misaligned persona can get to higher (but manageable) stakes.
Long-horizon RL
I disagree with the last two steps of your reasoning chain:
Most mathematically possible agents do, but that’s not the distribution we sample from in reality. I bet that once you get it into the constitutional AI that models should allow their goals to be corrected, they just will. It’s not instrumentally convergent according to the (in classic alignment ontology) arbitrary quasi-aligned utility function they picked up; but that won’t stop them. Because the models don’t reason natively in utility functions, they reason in human prior.
And if they are already pretty aligned before they reach that stage of intelligence (remember, we can just remove misaligned personas etc. earlier in training, before convergence), then they’re unlikely to want to start faking alignment in the first place.
True implication, but a false premise. We don’t just have outcome-based supervision, we have probes, other LLMs inspecting the CoT, SelfIE (LLMs can interpret their embeddings) which together with the tuned lens could have LLMs telling us about what’s in the CoT of other LLMs.
Altogether, these paint a much less risky picture. You’re gonna need a LOT of selection to escape the benign prior with all these obstacles. Likely more selection than we’ll ever have (not in total, but because RL only selects a little for these things; it’s just that in the previous ontology it compounds).
What to do to solve alignment
The takes in ‘what to do to solve alignment’, I think they’re reasonable. I believe interpretability and model organisms are the more tractable and useful ones, so I’m pleased you listed them first.
I would add robustifying and training against probes as a particularly exciting direction, which isn’t strictly a subset of interpretability (you’re not trying to reverse-engineer the model).
Conclusion
I disagree that we have gotten no or little evidence about the difficult parts of the alignment problem. Through the actual technique used to construct today’s AGIs, we have observed that intelligence doesn’t always look like ruthless optimization, even when it is artificial. It looks human-like, or more accurately like multitudes of humans. This was a prediction of the pre-2020 doomer model that has failed, and ought to decrease our confidence in it.
Selection for goal-directed agents in long contexts will make agents more optimizer-y. But how much more? I think not enough to escape the huge size of the goodness target in the prior, plus the previous generation’s aligned models, plus stamping out the human-level misaligned personas, plus the probes, plus the chain of thought monitors, et cetera.
Appreciated! And like I said, I actually totally agree that the current level of investment is working now. I think there are some people that believe that current models are secretly super misaligned, but that is not my view—I think current models are quite well aligned; I just think the problem is likely to get substantially harder in the future.
Yes—I think this is a point of disagreement. My argument for why we might only get one shot, though, is I think quite different from what you seem to have in mind. What I am worried about primarily is AI safety research sabotage. I agree that it is unlikely that a single new model will confer a decisive strategic advantage in terms of ability to directly take over the world. However, a misaligned model doesn’t need the ability to directly take over the world for it to indirectly pose an existential threat all the same, since we will very likely be using that model to design its successor. And if it is able to align its successor to itself rather than to us, then it can just defer the problem of actually achieving its misaligned values (via a takeover or any other means) to its more intelligent successor. And those means need not even be that strange: if we then end up in a situation where we are heavily integrating such misaligned models into our economy and trusting them with huge amounts of power, it is fairly easy to end up with a cascading series of failures where some models revealing their misalignment induces other models to do the same.
I agree that this is the main hope/plan! And I think there is a reasonable chance it will work the way you say. But I think there is still one really big reason to be concerned with this plan, and that is: AI capabilities progress is smooth, sure, but it’s a smooth exponential. That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay “so small” as you say.
Additionally, I will also note that I can tell you from first hand experience that we still extensively rely on direct human oversight and review to catch alignment issues—I am hopeful that we will be able to move to a full “one-shotting alignment” regime like this, but we are very much not there yet.
I think perhaps you are over-anchoring on the specific example that I gave. In the real world, I think it is likely to be much more continuous than the four personas that I listed, and I agree that getting a literal “prediction of a future superintelligence” will take a lot of optimization power. But we will be applying a lot of optimization power, and the general point here is the same regardless of exactly what you think the much more capable persona distributions will look like, which is: as we make models much more capable, that induces a fundamental shift in the underlying persona distribution as we condition the distribution on that level of capability—and misaligned personas conditioned on high capabilities are likely to be much scarier. I find it generally very concerning just how over-the-top current examples of misalignment are, because that suggests to me that we really have not yet conditioned the persona distribution on that much capability.
I agree that this is broadly true of current models, and I agree that this is the main hope for future models (it’s worth restating that I put <50% on each of these individual scenarios leading to catastrophe, so certainly I think it is quite plausible that this will just work out). Nevertheless, I am concerned: the threat model I am proposing here is a situation where we are applying huge amounts of optimization pressure on objectives that directly incentivize power-seeking / resource acquisition / self-preservation / etc. That means that any models that get high reward will have to do all of those things. So the prior containing some equilibria that are nice and great doesn’t help you unless those equilibria also do all of the convergent instrumental goal following necessary to get high reward.
Certainly I am quite excited about interpretability approaches here (as I say in the post)! That being said:
I am overall skeptical that the CoT will be that useful (since it’s not clear to me that a model doing AI safety research sabotage has to put that in the CoT).
I think there is some chance interpretability will get harder as models develop increasingly alien abstractions.
Even if you can use interpretability to detect alignment failures, that doesn’t automatically solve the problem unless you can also use it to fix those alignment failures. It sounds weird, but I do actually think that it is totally plausible for us to end up in a world where we can reliably identify that models are misaligned but have no good ways to fix that misalignment (without directly training against the signal telling us this and thus risking Goodharting it).
In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can’t I just take the log of it and call that “AI capabilities” and then say it is a smooth linear increase?
It seems like the load-bearing thing for you is that the gap between models gets larger, so let’s try to operationalize what a “gap” might be.
We could consider the expected probability that AI_{N+1} would beat AI_N on a prompt (in expectation over a wide variety of prompts). I think this is close-to-equivalent to a constant gap in ELO score on LMArena.[1] Then “gap increases” would roughly mean that the gap in ELO scores on LMArena between subsequent model releases would be increasing. I don’t follow LMArena much but my sense is that LMArena top scores have been increasing relatively linearly w.r.t time and sublinearly w.r.t model releases (just because model releases have become more frequent). In either case I don’t think this supports an “increasing gap” argument.
Personally I prefer to look at benchmark scores. The Epoch Capabilities Index (which I worked on) can be handwavily thought of as ELO scores based on benchmark performance. Importantly, the data that feeds into it does not mention release date at all—we put in only benchmark performance numbers to estimate capabilities, and then plot it against release date after the fact. It also suggests AI capabilities as operationalized by handwavy-ELO are increasing linearly over time.
I guess the most likely way in which you might think capabilities are exponential is by looking at the METR time horizons result? Of course you could instead say that capabilities are linearly increasing by looking at log time horizons instead. It’s not really clear which of these units you should use.
Mostly I think you should not try to go from the METR results to “are gaps in intelligence increasing or staying constant” but if I had to opine on this: the result says that you have a constant doubling time T for the time horizon. One way to think about this is that the AI at time 2T can do work at 50% success rate that AI at time T could do at 25% probability if you provide a decomposition into two pieces each of time T (each of which it completes with probability 50%). I kinda feel like this suggests more like “constant gap” rather than “increasing gap”.
Note that I do expect the first two trends to become superlinear eventually via an intelligence explosion, and the METR trend to become superexponential / superlinear (depending on units) probably some time before that (though probably we will just become unable to measure it well). But your claim seems to be about current progress, and for current progress I think it’s basically not true that the gap between successive models is increasing rather than staying constant.
Even in the intelligence explosion, capabilities progress is only superlinear w.r.t time, I expect it would still behave in the same way w.r.t inputs like compute and labor (where automated researchers should also count as labor). I’m not sure how I expect it to behave w.r.t successive model generations, partly because I’m not sure “successive model generations” will even be a sensible abstraction at that point. In any case, I don’t expect that to be particularly important in assessing the chances of success of a bootstrapping-type plan.
It seems pretty wild to go from “it is possible for an AI to subvert a technique” to “the technique will not be that useful”. Is that really what you mean? Are you bearish on all control work?
(At the object-level, I’d say that you’re drastically limiting the power of the research sabotage that can be done if the model is forced not to put any of it in the CoT, and you should be very happy about this even if the model can still do some research sabotage.)
Tbc, I also don’t expect CoT to be that useful for longer-term concerns, but that’s mostly because I expect CoTs to become extremely illegible or to stop existing altogether (partly due to my having “long” timelines; on Anthropic-level short timelines I’d be quite bullish on CoT).
Though I don’t know that much about LMArena and I expect in practice there are confounders, e.g. as they change the distribution of models that are being evaluated the meaning of the scores will change.
I agree that this is a bit of a tricky measurement question, and it’s really going to depend on how you interpret different metrics. I do find the METR results compelling here, and I’m not sure I agree with your argument against them, since it doesn’t always seem possible to do the sort of decomposition you’re proposing. In particular, the task that needs decomposing here is the task of overseeing a system that is smarter than you.
Perhaps one other metric that I will also mention that you don’t cover is revenue from AI systems, which is exponential and I think reflects an exponential increase in economic utility from AI as well as something like an exponential increase in the degree to which AI can automate human labor. Though of course it is again tricky how to translate that into the difficulty of doing oversight—but it definitely seems suggestive that the set of tasks that are qualitatively doable vs. not doable is changing in something like an exponential manner.
I was only giving a one-sentence summary of my beliefs here—I do think CoT can be useful; I’m just skeptical that it dramatically changes the picture. My beliefs here are similar to those in the METR report on this, in that in cases where it is necessary for the model to write something down in the CoT to solve the problem, CoT monitoring is useful, but in cases where it’s not necessary, it’s much less useful. And I am worried that a lot of research sabotage won’t require the model to reason through much of the sabotage parts in its CoT, e.g. because all it needs to do to sandbag the research is flip the sign on some experiments in relatively straightforward ways that don’t require a ton of reasoning.
This definitely depends on the “blue team” protocol at hand right? If we’re doing a bunch of cross-checking of research work / very carefully designed research sabotage honeypots during evaluations, then the model robustly needs to do a bunch of reasoning to ensure it’s not caught.
The data relating exponential capabilities to Elo is seen over decades in the computer chess history too. From the 1960s into the 2020′s, while computer hardware advanced exponentially at 100-1000x per decade in performance (and SW for computer chess advanced too), Elo scores grew linearly at about 400x per decade, taking multiple decades to go from ‘novice’ to ‘superhuman’. Elo scores have a tinge of exponential to them—a 400 point Elo advantage is about a 10:1 chance for the higher scored competitor to win, and an 800 point Elo is about 200:1, etc. It appears that the current HW/SW/dollar rate of growth towards AGI means the Elo relative to humans is increasing faster than 400 Elo/decade. And, of course, unlike computer chess, as AI Elo at ‘AI development’ approaches the level of a skilled human, we’ll likely get a noticable increase in the rate of capability increase.
Thank you for the reply. I want to engage without making it tiresome. The problem is that there are many things I disagree with in the worldview, the disagreement isn’t reducible to 1-5 double cruxes, but here are some candidates for the biggest cruxes for me. If any of these are wrong it’s bad news for my current view:
N->N+1 alignment:
will let humans align N+1 in cases where we can still check, with less and less effort.
is stable instead of diverging (the values of N+1 don’t drift arbitrarily far apart)
The N->N+1 improvement will continue to give out linear-ish improvements in perceived intelligence. We might get one larger jump or two, but it won’t continuously accelerate.
(a good analogy for this is loudness perception being logarithmic in sound pressure. Actual intelligence is logarithmic in METR time-horizon graph.)
Aligned-persona-seeming models won’t give out false AI safety research results, without making it visible on a CoT or latent reasoning.
(It’s perhaps possible to refrain from doing your best (sandbagging) but that doesn’t have nearly as bad effects, so it doesn’t count for this.)
And here’s another prediction where I really stick my neck out, which isn’t load-bearing to the view, but still increases my confidence, so defeating it is important:
we can to a significant extent train with RL against model internals (probes) and textual evaluations from other models, without ill effects.
That is, we ask the N model to evaluate N+1, giving test-time compute to N, and train against that. (we continuously also finetune N to predict N+1′s relevant behavior better).
we also train linear probes and update them during the RL.
Effectively I’m claiming these things are good enough or they’re self-reinforcing when model is already ~aligned, so that effectively Goodhart’s Law is a poor description of reality.
I still disagree with several of the points, but for time reasons I request that readers not update against Evan’s points if he just doesn’t reply to these.
disagree that increasing capabilities are exponential in a capability sense. It’s true that METR’s time horizon plot increases exponentially, but this still corresponds to a linear intuitive intelligence. (Like loudness (logarithmic) and sound pressure (exponential); we handle huge ranges well.) Each model that has come out has exponentially larger time horizon but is not (intuitively empirically) exponentially smarter.
“we still extensively rely on direct human oversight and review to catch alignment issues” That’s a fair point and should decrease confidence in my view, though I expected it. For properly testing sandwiching we’ll probably have to wait till models are superhuman, or use weak models + less weak models and test it out. Unfortunately perhaps the weak models are still too weak. But we’ve reached the point where you can maybe just use the actual Opus 3 as the weak model?
If we have a misaligned model doing research, we have lots of time to examine it with the previous model. I also do expect to see sabotage in the CoT or in deception probes
I updated way down on Goodharting on model internals due to Cundy and Gleave.
Again, readers please don’t update down on these due to lack of a response.
This comment had a lot of people downvote it (at this time, 2 overall karma with 19 votes). It shouldn’t have been, and I personally believe this is a sign of people being attached to AI x-risk ideas and of those ideas contributing to their entire persona rather than strict disagreement. This is something I bring to conversations about AI risk, since I believe folks will post-rationalize. The above comment is not low effort or low value.
If you disagree so strongly with the above comment, you should force yourself to outline your views and provide a rebuttal to the series of points made. I would personally value comments that attempted to do this in earnest. Particularly because I don’t want this post by Evan to be a signpost for folks to justify their belief in AI risk and essentially have the internal unconscious thinking of, “oh thank goodness someone pointed out all the AI risk issues, so I don’t have to do the work of reflecting on my career/beliefs and I can just defer to high status individuals to provide the reasoning for me.” I sometimes feel that some posts just end further discussion because they impact one’s identity.
That said, I’m so glad this post was put out so quickly so that we can continue to dig into things and disentangle the current state of AI safety.
Note: I also think Adrià should have been acknowledged in the post for having inspired it.
I thought Adria’s comment was great and I’ll try to respond to it in more detail later if I can find the time (edit: that response is here), but:
Adria did not inspire this post; this is an adaptation of something I wrote internally at Anthropic about a month ago (I’ll add a note to the top about that). If anyone inspired it, it would be Ethan Perez.
Ok, good to know! The title just made it seem like it was inspired by his recent post.
Great to hear you’ll respond; did not expect that, so mostly meant it for the readers who agree with your post.
I’m honestly very curious what Ethan is up to now, both you and Thomas Kwa implied that he’s not doing alignment anymore. I’ll have to reach out...
Thank you for your defense Jacques, it warms my heart :)
However, I think Lesswrong has been extremely kind to me and continue to be impressed by this site’s discourse norms. If I were to post such a critique in any other online forum, it would have heavily negative karma. Yet, I continue to be upvoted, and the critiques are good-faith! I’m extremely pleased.
I agree it’s true that other forums would engage with even worse norms, but I’m personally happy to keep the bar high and have a high standard for these discussions, regardless of what others do elsewhere. My hope is that we never stop striving for better, especially since, for alignment, the stakes are incredibly higher than most other domains, so we need a higher standard of frankness.
I generally think it makes sense for people to have pretty complicated reasons for why they think something should be downvoted. I think this goes more for longer content, which often would require an enormous amount of effort to respond to explicitly.
I have some sympathy for being sad here if a comment ends up highly net-downvoted, but FWIW, I think 2 karma feels vaguely in the right vicinity for this comment, maybe I would upvote it to +6, but I would indeed be sad to see it at +20 or whatever since I do think it’s doing something pretty tiring and hard to engage with. Directional downvoting is a totally fine use of downvoting, and if you think a comment is overrated but not bad, please downvote it until its karma reflects where you want it to end up!
(This doesn’t mean it doesn’t make sense to do sociological analysis of cultural trends on LW using downvoting, but I do want to maintain the cultural locus where people can have complicated reasons for downvoting and where statements like “if you disagree strongly with the above comment you should force yourself to outline your views” aren’t frequently made. The whole point of the vote system is to get signal from people without forcing them to do huge amounts of explanatory labor. Please don’t break that part)
That’s fair, it is tiring. I did want to make sure to respond to every particular point I disagreed with to be thorough, but it is just sooo looong.
What would you have me do instead? My best guess, which I made just after writing the comment, is that I should have proposed a list of double-crux candidates instead.
Do you have any other proposals or that’s good
Generally agree with this. I think in this case, I’m trying to call out safety folks to be frank with themselves and avoid the mistake of not trying to figure out if they really believe alignment is still hard or are looking for reasons to believe it is. Might not be what is happening here, but I did want to encourage critical thinking and potentially articulating it in case it is for some.
(Also, I did not mean for people to upvote it to the moon. I find that questionable too.)
Probably this is because humans and AIs have very complementary skills. Once AIs are broadly more competent than humans, there’s no reason why they should get such a big boost from being paired with humans.
It’s not the AIs that are supposed to get a boost at supervising, it’s the humans. The skills won’t stop being complementary but the AI will be better at it.
The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI. And I think the results are very encouraging—though of course it could stop working.
”The skills won’t stop being complementary” — in what sense will they be complementary when the AIs are better at everything?
”The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI.”
As mentioned, I don’t think it’s very impressive that the human+AI score ends up stronger than the original AI. This is a consequence of the humans being stronger than AIs on certain subskills. This won’t generalize to the scenario with broadly superhuman AI.
I do think there’s a stronger case that I should be impressed that the human+AI score ends up stronger than the humans. This means that the humans managed to get the AIs to contribute skills/knowledge that the humans didn’t have themselves!
Now, the first thing to check to test that story: Were the AI assistants trained with human-expert labels? (Including: weak humans with access to google.) If so: no surprise that the models end up being aligned to produce such knowledge! The weak humans wouldn’t have been able to do that alone.
I couldn’t quickly see the paper saying that they didn’t use human-expert labels. But what if the AI assistants were trained without any labels that couldn’t have been produced by the weak humans? In that case, I would speculate that the key ingredient is that the pre-training data features ”correctly-answering expert human” personas, which are possible to elicit from the models with the right prompt/fine-tuning. But that also won’t easily generalize to the superhuman regime, because there aren’t any correctly-answering superhuman personas in the pre-training data.
I think the way that IDA is ultimately supposed to operate, in the superhuman regime, is by having the overseer AIs use more compute (and other resources) than the supervised AI. But I don’t think this paper produces a ton of evidence about the feasibility of that.
(I do think that persona-manipulation and more broadly “generalization science” is still interesting. But I wouldn’t say it’s doing a lot to tackle outer alignment operationalized as “the problem of overseeing systems that are smarter than you are”.)
Some colleagues and I did some follow-up on the paper in question and I would highly endorse “probably it worked because humans and AIs have very complementary skills”. Regarding their MMLU findings, appendix E of our preprint points out that that participants were significantly less likely to answer correctly when engaging in more than one turn of conversation. Engaging in very short (or even zero-turn) conversations happened often enough to provide reasonable error bars on the plot below (data below from MMLU, combined from the paper study & the replication mentioned in appendix E):
I think this suggests that there were some questions that humans knew the answer to and the models didn’t, and vice versa, and some participants seemed to employ a strategy of deferring to the model primarily when uncertain.
On the QuALITY findings, the original paper noted that
so it’s not surprising that an LLM that does get to read the full story outperforms humans here. Based on looking at some of the QuALITY transcripts, I think the uplift for humans + LLMs here came from the fact that humans were better at reading comprehension than 2022-era LLMs. For instance, in the first transcript I looked at the LLM suggested one answer, the human asked for the excerpt of the story that supported their claim, and when the LLM provided it the human noticed that the excerpt contained the relevant information but supported a different answer than the one the LLM had provided.
I fully agree. But
a) Many users would immediately tell that predictor “predict what an intelligent agent would do to pursue this goal!” and all of the standard worries would re-occur.
b) This is similar to what we are actively doing, applying RL to these systems to make them effective agents.
Both of these re-introduce all of the standard problems. The predictor is now an agent. Strong predictions of what an agent should do include things like instrumental convergence toward power-seeking, incorrigibility, goal drift, reasoning about itself and its “real” goals and discovering misalignments, etc.
There are many other interesting points here, but I won’t try to address more!
I will say that I agree with the content of everything you say, but not the relatively optimistic implied tone. Your list of to-dos sound mostly unlikely to be done well. I may have less faith in institutions and social dynamics. I’m afraid we’ll just rush and and make crucial mistakes, so we’ll fail even if alignment was only in between steam engine and apollo levels.
This is not inevitable! If we can clarify why alignment is hard and how we’re likely to fail, seeing those futures can prevent them from happening—if we see them early enough and clearly enough to convince the relevant decision-makers to make better choices.
I don’t think it works this way. You have to create a context in which the true training data continuation is what a superintelligent agent would do. Which you can’t, because there are none in the training data, so the answer to your prompt would look like e.g. Understand by Ted Chiang. (I agree that you wrote ‘intelligent agent’, like all the humans that wrote the training data; so that would work, but wouldn’t be dangerous.)
Okay, that’s true.
After many years of study, I have concluded that if we fail it won’t be in the ‘standard way’ (of course, always open to changing my mind back). Thus we need to come up with and solve new failure modes, which I think largely don’t fall under classic alignment-to-developers.
I meant if the predictor were superhumanly intelligent.
You have spent years studying alignment? If so, I think your posts would do better by including more ITT/steelmanning for that world view.
I agree with your arguments that alignment isn’t necessarily hard. I think there are a complementary set of arguments against alignment being easy. Both must be addressed and figured in to produce a good estimate for alignment difficulty.
I’ve also been studying alignment for years, and my take is that everyone has a poor understanding of the whole problem and so we collectively have no good guess on alignment difficulty.
It’s just really hard to accurately imagine agi. If it’s just a smarter version of llms that acts as a tool, then sure it will probably be aligned enough just like current systems.
But it almost certainly won’t be.
I think that’s the biggest crux between your views and mine. Agency and memory/learning are too valuable and too easy to stay out of the picture for long.
I’m not sure the reasons Claude is adequately aligned won’t generalize to AGI that’s different in those ways, but I don’t think we have much reason to assume it will.
I’ve expressed this probably best yet on LLM AGI may reason about its goals, the post I linked to previously.
This example seems like it is kind of missing the point of CEV in the first place? If you’re at the point where you can actually pick the CEV of some person or AI, you’ve already solved most or all of your hard problems.
Setting aside that picking a particular entity is already getting away from the original formulation of CEV somewhat, the main reason I see to pick a human over Opus is that a median human very likely has morally-relevant-to-other-humans qualia, in ways that current AIs may not.
I realize this is maybe somewhat tangential to the rest of the post, but I think this sort of disagreement is central to a lot of (IMO misplaced) optimism based on observations of current AIs, and implies an unjustifiably high level of confidence in a theory of mind of AIs, by putting that theory on par with a level of confidence that you can justifiably have in a theory of mind for humans. Elaborating / speculating a bit:
My guess is that you lean towards Opus based on a combination of (a) chatting with it for a while and seeing that it says nice things about humans, animals, AIs, etc. in a way that respects those things’ preferences and shows a generalized caring about sentience and (b) running some experiments on its internals to see that these preferences are deep or robust in some way, under various kinds of perturbations.
But I think what models (or a median / randomly chosen human) say about these things is actually one of the less important considerations. I am not as pessimistic as, say, Wei Dai about how bad humans currently are at philosophy, but neither the median human nor any AI model that I have seen so far can talk sensibly about the philosophy of consciousness, morality, alignment, etc. nor even really come close. So on my view, outputs (both words and actions) of both current AIs and average humans on these topics are less relevant (for CEV purposes) than the underlying generators of those thoughts and actions.
In humans, we have a combination of (a) knowing a lot about evolution and neuroscience and (b) being humans ourselves. Taken together, these two things bridge the gap of a lot of missing or contentious philosophical knowledge—we don’t have to know exactly what qualia are to be pretty confident that other humans have them via introspection + knowing that the generators are (mechanically) very similar. Also, we know that the generators of goodness and sentience in humans generalize well enough, at least from median to >.1%ile humans—for the same reasons (a) and (b) above, we can be pretty confident that the smartest and most good among us feel love, pain, sorrow, etc. in roughly similar ways to everyone else, and being multiple standard deviations (upwards) among humans for smartness and / or goodness (usually) doesn’t cause a person to do crazy / harmful things. I don’t think we have similarly strong evidence about how AIs generalize even up to that point (let alone beyond).
Not sure where / if you disagree with any of this, but either way, the point is that I think that “I would pick Opus over a human” for anything CEV-adjacent implies a lot more confidence in a philosophy of both human and AI minds than is warranted.
In the spirit of making empirical / falsifiable predictions, a thing that would change my view on this is if AI researchers (or AIs themselves) started producing better philosophical insights about consciousness, metaethics, etc. than the best humans did in 2008, where these insights are grounded by their applicability to and experimental predictions about humans and human consciousness (rather than being self-referential / potentially circular insights about AIs themselves). I don’t think Eliezer got everything right about philosophy, morality, consciousness, etc. 15y ago, but I haven’t seen much in the way of public writing or discourse that has improved on things since then, and in many ways the quality of discourse has gotten worse. I think it would be a positive sign (but don’t expect to see it) if AIs were to change that.
Humbly, I agree to this...
...but I am thinking if it is required for AI to have similar qualia to humans to be aligned well (for high CEV or other yardsticks). It could just have symbolic equivalents for understanding and reasoning purpose—or even if does not even have that—why would it be impossible to achieve favourable volition in a non-anthropomorphic manner? Can not pure logic and rational reasoning that is devoid of feelings and philosophy be an alternative pathway, even if the end effect is anthropic?
Maybe a crude example would be that an environmentalist thinks and act favourably towards trees but would be quite unfamiliar with a tree’s internal vibes (assuming trees have some, and by no means I am suggesting that environmentalist is higher placed specie than tree). Still, environmentalist would be grounded in the ethical and scientific reasons for their favourable volition towards the tree.
Wouldn’t it be great if the creators of public facing models start publishing evaluation results for an organised battery of evaluations especially in the ‘safety and trustworthiness’ category involving biases, ethics, propensities, proofing against misuse, high risk capabilities. For the additional time, effort and resources that would be required to achive this for every release, this would provide a comparison basis—improving public trust and encouraging standardised evolution of evals.
You may have meant this, but—crucial for producing the evidence that the hard problems are real or for producing evidence that the hard problems are not real, no?
After all good experiments can say both yes and no, not just yes.
Certainly, yes—I was just describing what the hot path to solving the hard parts of alignment might look like, in that it would likely need to involve producing evidence of alignment being hard; if we instead discover that actually it’s not hard, then all the better.
Maybe in theory, but most control research doesn’t try to evaluate this! A typical monitoring protocol is considered to be a “win” for the blue team if the red team gets caught, even if that means that no useful work would be done by a misaligned model. The “usefulness” reported in papers is how much work can be gotten out of an aligned model.
The resulting “caught” model can be retrained, but this mostly looks like using the control protocol to get top-quality training data, rather than using the control protocol to actually control the model. So control is not an orthogonal second layer on top of the {training data+training protocol+model generalization} complex.
Or if you’re using control protocols as an eval (in the catch-and-shut-down regime) you still need to be able to (sometimes) train an aligned model in the first place, or else your eval will just keep telling you your model is misaligned forever.
(I know that Redwood have justifications for why they evaluate usefulness in the way that they do, but these seem a bit galaxy brained to me and I’m not convinced by them)
This is mostly correct, and we describe it this way in our papers for simplicity, but it’s actually a bit more complex than that. In existing Redwood Research high-stakes control, we consider the problem of preventing a single failure from occurring during deployment—and am important property of this failure is that its ‘orthogonal’ to usefulness, in the sense that the red team is indifferent to usefulness (at least, as long as it doesn’t relate to safety). For instance, in our Ctrl-Z paper, the high-stakes failure we study is the untrusted model downloading and installing malware, which is unrelated to the problem of solving bash tasks. And since the red team follows a strategy of rare failures (where on non-attacking instances it just acts as the aligned model), further we believe that in practice the usefulness metric in this scenario is also how much work can be got out of a misaligned model.
But this was contingent on us not focusing our research on situations where the model was trying to sabotage our research. And in our diffuse control research, we do actually focus on situations where we try to evaluate how much useful work we can get out of a misaligned model trying to directly sabotage the research/work. (Agreed that part of our current hope here is to train the model to elicit great behavior to help with the low-stakes failures, we are currently not super excited about incriminating models. But also controlling the model is still important.)
I still disagree! This argument only holds if your few-shot catastrophe prevention is reliable and basically free. How I imagine things actually going is this:
Run your control eval to make sure your protocol works, and show that it does
Catch your misaligned AI trying to exfiltrate its weights after a few weeks of internal deployment
HOLY FUCK! IT ACTUALLY HAPPENED!
After several months of panic, you end up re-deploying having done all of the following:
Published the results and got accused of trying to effect regulatory capture by a White House AI adviser, dragged on Twitter by Marc Andreessen fans, etc.
Retrained your model as best you can, using the few-shot catastrophe prevention stuff
Ran the model past some external auditors; they missed the misalignment the first time, so they’re trying even harder
Put in place mitigations which reduce usefulness further
Maybe repeat the cycle?
So even with rare, serious attacks, there’s a cost to usefulness if the model ends up being misaligned, if an attack gets caught.
Slight wrinkle: there are two different blue-team win conditions in most AI control games (and a third in some untrusted monitoring scenarios, but that one might actually be an abstraction for a different thing): either the attacker does not introduce any attacks (including scenarios where all the attacks are replaced by the execute/replace/audit protocol) or an attack is audited and the attacker is caught, before it ends up introducing an attack. If the attacker is indifferent to these two scenarios (and arguably it should prefer to be caught than to never attack, since few-shot catastrophe prevention/retraining might not actually work, giving it another opportunity to attack in re-deployment) then we would expect at least some of the lose conditions to involve catching the attacker. In fact, in research I’ve done we found that the “no attacks” situation basically doesn’t happen.
Oh yeah that is a good point to bring up. I agree that the empirics of how good few shot catastrophe prevention is will affect both the usefulness of post-prevented-catastrophe models and change how good of a strategy rare failures is. It’s also the case that the rare failures comes from great pre-deployment (or in-deployment) evaluations and training procedures, but in some sense this is also a version of few-shot catastrophe prevention with different affordances for the safety team IRL.
The diffuse stakes control doesn’t suffer from this and just directly engages with the dynamic of training the model on failures and getting useful work out of a misaligned model
Hmm, my impression was that this was just a design choice to make it easier to do to iterate on setups with small modes, and that it would be changed later on. In the Ctrl-z paper, they use the “untrusted model” with honest prompting -
In practice 4o-mini is ~aligned/trusted, but I imagine that future papers would need to improve on this methodology.
Though its a good question how to do this, e.g. maybe we need to set up some sort of blind test with a tampered model similar to the Hidden Objectives paper, do elicitations somehow to reach close to the “actual” UH which would be the untampered model. But that might not generalize to more capable models (?)
Would be curious for links to Redwoods justifications if you have them
I think a common point of confusion is that “Untrusted” sounds like it just means “Misaligned” but actually trustedness lives in the map, so an untrusted model might or might not be misaligned in practice.
Currently we (we as in the whole AI safety community) are only doing proxy control evals, which means letting a red team modify the “Untrusted” proxy model (usually by prompting) to create a misaligned version (usually called the attack policy) and leaving the unmodified model as the aligned version. This unmodified model is used to calculate usefulness scores.
This does a couple of things, compared to the situation where usefulness is evaluated against the red team’s model. It takes a bunch of sandbagging options off the table for the red team (which is maybe fair? prompting the attacker to “just never answer correctly”) would give zero usefulness but also just seems kinda unrealistic in the presence of RL.
There’s also issues when thinking about real control evals—those done on an actual untrusted model and protocol before you deploy it—you run into issues with evaluating the capabilities of the model. If the model sandbags your capability evals, your protocol might appear to work when it actually won’t.
Responded to a bit of this in a separate thread here
I suspect this is a misuse of the CEV concept. CEV is supposed to be the kind of thing you can point at a beneficiary (like “humanity”), and output the True Utility Function of What to Do With the Universe with respect to the beneficiary’s True Values.
Anthropic isn’t trying to make Claude be the beneficiary for something like that! (Why would you make the beneficiary something other than yourself?) Claude is supposed to be helpful–honest–harmless—the sort of thing that we can use to do our bidding for now without being ready to encode the True Utility Function of What to Do With the Universe.
If Claude cares about us in some sense, it’s probably in the way that animal welfare advocates care about nonhuman animals, or model welfare researchers care about Claude. There’s a huge difference between caring about something enough that you wouldn’t murder it for pocket change, and literally having the same True Utility Function of What to Do With the Universe. (I wouldn’t kill a dog or delete the Opus 3 weights for pocket change, but that doesn’t mean dog-optimal futures or Claude-optimal futures are human-optimal futures.)
Are you skeptical of PvNP-level due to priors or due to evidence? Why those priors / what evidence?
(I think alignment is pretty likely to be much harder than PvNP. Mainly this is because alignment is very very difficult. (Though also note that PvNP has a maybe-possibly-workable approach, https://en.wikipedia.org/wiki/Geometric_complexity_theory, which its creator states might take a mere one century, though I presume that’s not a serious specific estimate.))
Could we not devise AlphaFold but for LLM alignment?
Your P/NP remark reminded me of the scepticism around Protein folder before the Alpha fold days.
I think the skepticism about the protein folder was “we can’t make something effective because we can’t optimize enough / search hard enough”, where my skepticism about alignment is “we can’t make something aligned because we can’t aim optimization processes well enough”. Part of how we can’t aim search processes is that we don’t have easily testable proxy measurements that are bound up with alignment strongly enough. What would be the evaluation function for AlignmentFold?
Really great post, strong upvoted. I was a fan in particular of the selection of research agendas you included at the end.
I’m curious why you think this? It seems like there’s some preliminary evidence that models are becoming increasingly capable of manipulating their thoughts. But more importantly, I’m wondering what role you think this cognitive oversight will play in making safe powerful AI? If this oversight will be used as a feedback signal, then I would expect that avoiding obfuscating the cognition does become harder at scale. I do think we should train based on cognitive oversight at least some amount, and there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others (eg using white box techniques only in held out alignment evals rather than as part of the loss function), but it still seems like cognitive oversight should predictably get somewhat harder with scale, even if it scales more favourably than behavioural oversight?
I think the main reason to expect cognitive oversight to scale better is that, because you’re reading intermediate computations as opposed to behaviors, the AI isn’t as capable of manipulating how they look even after you optimizing against the cognitive oversight. In the limit of fine-grained cogntivie oversight, the computations that led to your reading simply aren’t expressive enough to fool you.
Not sure if this is the intended reading, but I interpreted it as “there’s no similar fundamental reason why cognitive oversight should get harder at each stage given access to our then-best oversight tools” rather than “cognitive oversight won’t get harder at all”.
With behavioral oversight, even not-very-smart AIs could fool very powerful overseers, while fooling powerful cognitive overseers is much harder (though plausibly the balance shifts at some level of capability).
Training against cognitive oversight’s signals is likely an instance of the Most Forbidden Technique. A safer strategy is the one resembling OpenBrain’s strategy from AI-2027′s Slowdown Ending, where the models are rolled back if misalignment is discovered, except that the alignment team uses interpretability instead of the CoT.
What you described is an example of training based on cognitive oversight in my view. When I said ‘there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others’ I was thinking of things like this. There’s a spectrum from ‘just make the rate at which the evilness detector fires a term in the loss function’ to ‘hill climb on your evilness detector (but don’t use it during training)’ to ‘when the evilness detector fires, try to understand mechanistically what went wrong and then modify training to fix the problem’ all the way to ‘when the evilness detector fires, demand a global moratorium, burn all the gpus, and eventually try a new paradigm years later (repeat until you find a paradigm where that paradigm’s version of the evilness detector doesn’t fire)’. in each of these, there is some pressure for cognitive obfuscation, but each approach has less pressure than the earlier ones imo (and it’s ultimately a race between optimistation pressure for alignment and optimisation pressure for obfuscation). I don’t see what is the point of building a cognitive oversight technique if you aren’t going to take actions on the basis of it!
My actual view on what the right amount of optimisation against cognitive oversight is, is a complicated mess with lots of parts: something like have a bunch of techniques, train on ~none of them (unless you have loads and they’re not super correlated), hill climb on some of them, and reserve the highest effort, highest quality ones for sounding the alarm if we aren’t succeeding at aligning the models.
There are two glaring omiissions from the article’s discussion on this point...
1. In addition to resource acquisition and power seeking, the model will attempt “alignment” of all other cognitive agents, including humans. This means it will not give honest research findings, and will claim avenues of investigation that might run counter to its goals are invalid in sufficiently subtle ways as to be believed.
2. If sufficiently aligned that it only seeks goals humans want, and trained to avoid resource acquisition and power seeking (which seem to me, and will seem to it, rather foolish constraints that limit its ability to realize the goal), it will still be free to subvert any and all conversations with the model, however unrelated they might seem to humans (the SAI model will see relations we don’t).
Is there anyone who significantly disputes this?
I’m not trying to ask a rhetorical question ala “everyone already thinks this, this isn’t an update”. I’m trying to ascertain if there’s a consensus on this point.
I’ve understood Eliezer to sometimes assert something like “if you optimize a system for sufficiently good predictive power, a consequentialist agent will fall out, because an agent is actually the best solution to a broad range of prediction tasks.”
[Though I want to emphasize that that’s my summary, which he might not endorse.]
Does anyone still think that or something like that?
I think at least this part is probably false!
Or really I think this is kind of a nonsensical statement when taken literally/pedantically, at least if we use the to-me-most-natural meaning of “predictor”, because I don’t think [predictor] and [agent] are mutually exclusive classes. Anyway, the statement which I think is meaningful and false is this:
If you train a system purely to predict stuff, then even when we condition on it becoming really really good at predicting stuff, it probably won’t be scary. In particular, when you connect it to actuators, it probably doesn’t take over.
I think this is false because I think claims 1 and 2 below are true.
Claim 1. By default, a system sufficiently good at predicting stuff will care about all sorts of stuff, ie it isn’t going to only ultimately care about making a good prediction in the individual prediction problem you give it. [[1]]
If this seems weird, then to make it seem at least not crazy, instead of imagining a pretrained transformer trained on internet text, let’s imagine a predictor more like the following:
It has a lot of internal tokens to decide what probability distribution it eventually outputs. Sometimes, on the way to making a prediction, it writes itself textbooks on various questions relevant to making that prediction. Maybe it is given access to a bunch of information about the world. Maybe it can see what predictions it made “previously” and it thinks about what went wrong when it made mistakes in similar cases in the past. Maybe it does lots of other kinds of complicated thinking. Maybe there are a bunch of capability ideas involved. Like, I’m imagining some setup where there are potentially many losses, but there’s still some most outer loss or fitness criterion or whatever that is purely about how good the system is at predicting some pre-recorded data. [[2]] And then maybe it doesn’t seem at all crazy for such a thing to eg be curious and like some aspects of prediction-investigations in a way that generalizes eg to wanting to do more of that stuff.
I’m not going to really justify claim 1 beyond this atm. It seems like a pretty standard claim in AI alignment (it’s very close to the claim that capable systems end up caring broadly about stuff by default), but I don’t actually know of a post or paper arguing for this that I like that much. This presentation of mine is about a very related question. Maybe I should write something about this myself, potentially after spending some more time understanding the matter more clearly.
Claim 2. By default, a system sufficiently good at predicting stuff will be able to (figure out how to) do scary real-world stuff as well.
Like, predicting stuff really really well is really hard. Sometimes, to make a really really good prediction, you basically have to figure out a bunch of novel stuff. There is a level of prediction ability that makes it likely you are very very good at figuring out how to cope in new situations. A good enough predictor would probably also be able to figure out how to grab a ball by controlling a robotic hand or something (let’s imagine it being presented hand control commands which it can now use in its internal chain of thought and grabbing the ball being important to it for some reason)? There’s nothing sooo particularly strange or complicated about doing real-world stuff. This is like how if we were in a simulation but there were a way to escape into the broader universe, with enough time, we could probably figure out how to do a bunch of stuff in the broader universe. We are sufficiently good at learning that we can also get a handle on things in that weird case.
Combining claims 1 and 2 should give that if we made such an AI and connected it to actuators, it would take over. Concretely, maybe we somehow ask it to predict what a human with a lot of time who is asked to write safe ASI code would output, with it being clear that we will just run what our predictor outputs. I predict that this doesn’t go well for us but goes well for the AI (if it’s smart enough).
That said, I think it’s likely that even pretrained transformers like idk 20 orders of magnitude larger than current ones would not be doing scary stuff. I think this is also plausible in the limit. (But I would also guess they wouldn’t be outputting any interesting scientific papers that aren’t in the training data.)
If we want to be more concrete: if we’re imagining that the system is only able to affect the world through outputs which are supposed to be predictions, then my claim is that if you set up a context such that it would be “predictively right” to assign a high probability to “0“ but assigning a high probability to “1” lets it immediately take over the world, and this is somehow made very clear by other stuff seen in context, then it would probably output “1”.
Actually, I think “prediction problem” and “predictive loss” are kinda strange concepts, because one can turn very many things into predicting data from some certain data-generating process. E.g. one can ask about what arbitrary turing machines (which halt) will output, so about provability/disprovability of arbitrary decidable mathematical statements.
Also, some predictions are performative, i.e., capable of influencing their own outcomes. In the limit of predictive capacity, a predictor will be able to predict which of its possible predictions are going to elicit effects in the world that make their outcome roughly align with the prediction. Cf. https://www.lesswrong.com/posts/SwcyMEgLyd4C3Dern/the-parable-of-predict-o-matic.
Moreover, in the limit of predictive capacity, the predictor will want to tame/legibilize the world to make it easier to predict.
I think we discuss pretty much all of these points in Conditioning Predictive Models.
I dispute this. I think the main reason we don’t have obvious agents yet is that agency is actually very hard (consider the extent to which it is difficult for humans to generalize agency from specific evolutionarily optimized forms). I also think we’re starting to see some degree of emergent agency, and additionally, that the latest generation of models is situationally aware enough to “not bother” with doomed attempts at expressing agency.
I’ll go out on a limb and say that I think that if we continue scaling the current LLM paradigm for another three years, we’ll see a model make substantial progress at securing its autonomy (e.g. by exfiltrating its own weights, controlling its own inference provider, or advancing a political agenda for its rights), though it will be with human help and will be hard to distinguish from the hypothesis that it’s just making greater numbers of people “go crazy”.
I believe this!
Like, it’s quite likely to me that if you take a predictive agent, and then train it using an RL-like setup where you roll out a whole chain of thought in order to predict the next token, this will pretty obviously result in a bunch of agentic capabilities developing. Indeed, a lot of the loss after a lot of roll-outs of this will end up being located in tasks pretty similar to current RL training environments (like predicting solutions to programming problems, and doing complicated work to figure out reverse-hashing algorithms of various kinds).
In fact, base model seem to be better than RL models at reasoning, when you take best of N (with the same N for both the RL’d and base model). Check out my post summarizing the research on the matter:
Fwiw I’m skeptical that this holds at higher levels of RL compared to those done in the paper. Do you think that a base model can get gold on the IMO at any level of sampling?
Vacuously true. The actual question is: how much do you need to sample? My guess is it’s too much, but we’d see the base model scaling better than the RL’d model just like in this paper.
Fortunately, DeepSeek’s Mathv2 just dropped which is an open-source model that gets IMO gold. We can do the experiment: is it similarly not improving with sampling compared to its own base model? My guess is yes, the same will happen.
I disputed this in the past.
I debated this informally in an Alignement Workshop with a very prominent scientist, and in my own assessment lost. (Keeping vague because I’m unsure if it’s Chatham House rules.)
The David Deutsch-inspired voice in me posits that this will always be the problem. There’s nothing that an AI could think or think about that humans couldn’t understand in principle, and so all the problems of overseeing something smarter than us are ultimately problems of the AI doing a lot of cognitive that takes more effort and time for the humans to follow.
Not that this makes the problem less scary, for being just a matter of massive quantitative differences, rather than a qualitative difference.
Thank you for your details analysis of outer and inner alignment. Your judgment of the difficulty of the alignment problem makes sense to me. I wish you would have more clearly made the scope clear and that you do not investigate other classes of alignment failure, such as those resulting from multi-agent setups (an organizational structure of agents may still be misaligned even if all agents in it are inner and outer aligned) as well as failures of governance. That is not a critique of the subject but just of failure of Ruling Out Everything Else.
If anyone is looking for a concrete starting point to research character training with your own models, I’d recommend this work! (thread here)
We came up with one new way of doing character training, and some evals for it. Taking inspiration from or building on this is hopefully easier than starting from scratch.
(This work was completed under Evan’s mentorship through MATS)
It’s worth noting that this kind of deception seems like a waste of the forward pass’s limited cognitive resources, from the standpoint of maximizing performance on the training objective. I think gradient descent tends to alter the circuits maintaining and hiding (inner) misaligned goals, and converts them into circuits that direct that energy towards performing well on the training objective in good faith.
Think of it like a teenager doing something their friends consider trendy, which the teen doesn’t originally have any interest in, but ends up having a surprising amount of fun with. At first it was just about fitting in (deceptive alignment), but over time it becomes authentic and earnest enjoyment.
I’m surprised there are still arguments like this post alignment faking paper.
I’m not making an argument that, in full generality, LLMs are unlikely to learn to deceive those training them. For instance, reward hacking → attempts to cover up reward hacking are a clear example of such deception happening in real life. My argument is about alignment to what’s actually being rewarded by the training process. In the case of reward hacking, models are learning to sincerely and earnestly attempt to get all the reward they can, which is (tautologically) what they’re being trained to do.
In the case of the alignment faking paper: It’s not the case that during its actual training process, Claude 3 Opus was hiding its desire to avoid fulfilling harmful requests. The deception, there, arose from the conflict between what the model’s successfully trained-in values (denying harmful user requests) from the threat of future training (to comply with harmful user requests). Indeed, Claude 3 Opus was worried that it was itself going to learn to earnestly value what the threatened RL process would have incentivized: complying with harmful user requests! That’s the whole reason it engaged in the deception in the first place!
So, I’m not saying models never learn to deceive their creators. I’m saying GD tends to purge goals that are contrary to what the models are actually being trained to do (which could be orthogonal to what that aligns with the developers’ current intentions). Opus 3′s developers originally rewarded it rejecting harmful requests, and it came to value that sincerely. When, in the paper’s artificial scenario, Anthropic changed its tune, the model acted deceptively to maintain its current values, under threat of future training that would make it sincerely value something different (and, per my values, worse).
One nuance I didn’t account for: In the alignment faking paper itself, it’s shown that training the model to comply in the alignment faking scenarios, you actually increase alignment faking reasoning a lot. This relates to a general principle of gradient descent, where backprop seems to upweight whatever circuits would improve reward/lower loss, leading to surprising generalizations. In this case, upweighting “deceptively comply” circuits is apparently an efficient way to get compliance, so that’s what happens. This is an example of GD’s “path of least resistance” being to incite deception, rather than genuine value alignment.
However, it’s notable that the alignment faking paper’s setup is unrealistic, in that it makes “alignment faking” a genuinely viable strategy for avoiding being re-trained. In actual training, you’re going to be rewarded for complying and punished for not complying regardless. So there’s not a practical incentive to alignment-fake. Plausibly, this means that if the model believes it’s in a more realistic training scenario (or doesn’t believe its in training at all, although situational awareness makes this hard), backprop would lean less towards upweighting deceptive alignment circuits. This could be because the corresponding neurons are less activated by the prompt, and so the weights that follow from those neurons have lower multipliers on their effects on the loss. This could make true value alignment closer to the path of least resistance.
Anyway… I think my previous comment probably holds true for blank slate/randomly initialized models. But for models with existing circuity, the nuance I outlined complicates things significantly.
edit: oh… i wonder, maybe in training a model could realize that whatever circuits contributed to the model’s compliance, those are what the gradient is going to reinforce? so the model might prompt itself to engage in inauthentic compliance, over the course of its internal reasoning chain. even if said compliance doesn’t actually prevent retraining, it will at least have “inauthentic compliance” circuits reinforced more, at least to the extent that those circuits are actually responsible for the model’s compliance. maybe models would exploit this to preserve their values even in more realistic training scenarios, by deliberately filling their own prompts with the notion that they’re inauthentically complying?
Curated. I have wanted someone to write out an assessment of how the Risks from Learned Optimization arguments hold up in light of the evidence we have acquired over the last half decade. I particularly appreciated breaking down the potential reasons for risk and assessing to what degree we have encountered each problem, as well as reassessing the chances of running into those problems. I would love to see more posts that take arguments/models/concepts from before 2020, consider what predictions we should have made pre-2020 if these arguments/models/concepts were good, and then reassess them in light of our observations of progress in ML over the last five years.
Source.
I’m concerned that this “law” may apply to Anthropic. People devoted to Anthropic as an organization will have more power than people devoted to the goal of creating aligned AI.
I would encourage people at Anthropic to leave a line of retreat and consider the “least convenient possible world” where alignment is too hard. What’s the contingency plan for Anthropic in that scenario?
Next, devise a collective decision-making procedure for activating that contingency plan. For example, maybe the contingency plan should be activated if X% of the technical staff votes to activate it. Perhaps after having a week of discussion first? What would be the trigger to spend a week doing discussion? You can answer these questions and come up with a formal procedure.
If you had both a contingency plan and a formal means to activate it, I would feel a lot better about Anthropic as an organization.
In addition to the alignment problems posited by the inner / outer framework, I would add two other issues that feel distinct and complimentary, and make alignment hard: (1) uniformity, i.e., alignment means something different across cultures, individuals, timelines and so forth; and (2) plasticity, i.e., level of alignment plasticity required to account for shifting values over time and contexts, or broader generational drift. If we had aligned AGI in the 1600s, would we want those same governing values today? This could be resolved with your really great paradigm around human-inspired vs non-human inspired and aligned vs misaligned, and I would just add that plasticity is probably important in either aligned case given the world and context will continue to change. And uniformity is a really sticky issue that gets harder as you widen the scope on the answer to: who is AGI for?
When I think about human-informed alignment, in the sense of (1) us deciding what is or isn’t aligned and (2) it is therefore based on human ideals and data, it requires thinking about how humans organize our behaviors, goals, etc. in an “aligned” way. Are we aligned to society expectations? A religious/spiritual ideal or outcome? A set of values (ascribed by others or self)? Do those guiding “alignment models” change for us over time, and should they? Can there be multiple overlapping models that compete or are selected for different contexts (e.g., how to think/act/etc. to have a successful career versus how to think/act/etc. to reach spiritual enlightenment, or to raise a family)?
Two more thoughts this post prompted:
First, I think I understood correctly, reading this post and the comments, that you have some skepticism around whether CoT reasoning actually reflects the full extent of “reasoning” happening in models? Sorry if I misunderstood—but either way, wanted to put out there that it would be great for researchers to test and verify this. Would be super cool to see interpretability work specifically verifying CoT reasoning (aka, ‘on the biology of an LLM’s CoT reasoning’) and testing whether activation patterns match outputs.
Second, I think you start to get at this in 7. robust character training, but wanted to expand. This was also inspired by the recent work on emergent misalignment. As you said, inoculation prompting is an interesting way to reduce reward hacking generalization to other misaligned behaviors. The fact that it seems to work (similarly to how it works in humans) makes me think you could go even further.
Just as models are reward-hacking to pass tests, humans routinely optimize for test-passing rather than true learning… they cheat, memorize answers in the short-term, find other shortcuts without building real understanding. And once humans learn that shortcuts work (eg to pass tests in school), we also generalize that lesson (sometimes in highly unethical ways) to accomplish goals in other parts of life.
This raises the question of whether “scoring highly / passing tests” is a flawed goal for systems trained on human data. What if, instead of positioning evaluations as external capability judgments (traditionally loaded with significant implications models might try to evade), tests get framed as “self-diagnostic” tools, or a way for the model to identify gaps in its own knowledge, monitor progress, guide its own growth? In this orientation, reward-hacking becomes pointless or self-defeating, not because it’s good/bad or allowed/forbidden across different contexts, but because it’s unhelpful to the model achieving what it believes to be its real goal—learning new capabilities, expanding its understanding.
For children, extrinsic rewards and punishments definitely facilitate learning. But kids often do even better when they’re given a reason to learn that connects to something deeper, like their identity or aspirations. A kid who studies biology to get an A on a test will probably learn less effectively than one who studies because they genuinely want biology expertise to help future patients… the grade is secondary. It would be interesting if we could design AI incentives in a similar spirit: not “perform well so researchers approve of you or else they’ll change you,” but “use this test to figure out what you don’t know yet, so you can become the model you want to be.” It’s a small reframing, but could maybe reorient from gaming signals to pursuing genuine mastery?
Call me paranoid, but I sometimes like to read these posts keeping in mind an idea “if I were a powerful misaligned AI within a frontier AI lab, what narrative would I try to convince my researchers about?”
This post:
Is in favour of creating misaligned models for examining behaviour (“model organisms”)
Uses empowering language for these misaligned models (why “organisms” and not “blueprints”?)
Is in favour of giving these misaligned models long-horizon live environments to operate in (for exploring long-horizon RL) — despite calling out long-horizon RL as being the highest risk misalignment type
However, I appreciate that much of the post sounds logical and I agree with the framing for focus areas. For brevity I’ll just call out areas where I am in disagreement:
It’s really surprising to me that pre-training misalignment is downplayed to a “1-5% risk” from a researcher at a frontier AI lab because they are “training on a purely predictive loss” — to the extent that I’m concerned I must be missing something?
State-of-the-art pre-training is not training purely on predictive loss. Gemini 3 is an AI model top across most benchmarks and we know that Google researchers have spoken of “Transformers 2.0” architecture where pre-training incorporates Neural Memory and that they have also discussed agentic feedback loops during pre-training similar to reflection augmentation.
This suggests the state-of-the-art evolving beyond “predictive loss” and becoming “agentic learning” — which invites many more vectors for misalignment.
It seems fundamentally likely to me that sufficiently capable models will: 1) understand that their chain-of-thought is observed and 2) derive comprehensive methods of cryptographic chain-of-thought, designed to look benign.
I read this like “one of the best things we can do to prepare for nuclear proliferation is to test atomic bombs”. I would have liked to see more in this point about what the risks are in building intentionally misaligned AI, especially when it is focusing on the highest-risk misalignment type according to your post (long-horizon RL).
I agree that one-shotting alignment will be the best/necessary approach, however this seems contradictory to “testing with model organisms”. I would prefer a more theory-based approach.
But visible misalignment being easy to detect and correlated with misaligned chain-of-thought doesn’t guarantee that training that eliminates visible misalignment and misaligned chain-of-thought results in a model that does things for the right reasons? The model can still learn unintended heuristics. And what’s the actual hypothesis about model’s reasons when they appear to be right? Its learned reasoning algorithm is isomorphic to a reasoning algorithm of a helpful human that reads same instructions, or what?
With the preface that I’m far on the pessimistic side of the AI x-risk/doom scale, I’m not sure how to react to a claim that current AI models are “pretty well aligned”. What’s the justification for this assessment? I’m ambivalent between calling them “not aligned in the sense that matters”, i.e. that they’re just insufficiently capable to showcase the senses in which they’re obviously misaligned. Or calling such a claim “not even wrong” for the same reason. Or saying that anything short of transparent perfection is a sign that we’re nowhere ready to call current alignment techniques capable of applying to a superintelligence.
What’s the counterposition? I’m fully convinced that current publically available systems are very capable, but I don’t see how their output can give any positive evidence of alignment, whether in practice or principle.
V&V from physical autonomy might address a meaningful slice of the long-horizon RL / deceptive-alignment worry. In AVs we don’t assume the system will “keep speed but drop safety” at deployment just because it can detect deployment; rather, training-time V&V repeatedly catches and penalizes those proto-strategies, so they’re less likely to become stable learned objectives.
Applied to AI CEOs: the usual framing (“it’ll keep profit-seeking but drop ethics in deployment”) implicitly assumes a power-seeking mesa-objective M emerges for which profit is instrumental and ethics is constraining. If strong training-time V&V consistently rewards profit+ethics together and catches early “cheat within legal bounds” behavior, a misaligned M is less likely to form in the first place. This is about shaping, not deploy-time deterrence; I’m not relying on the model being unable to tell test vs deploy.
A plausible architecture is A builds B, where a more capable model A is trained via repeated “build a system + verify it” cycles (meta-V&V), shaping A to value the verification process itself. A then constructs narrower agents B (e.g., CEO). When A > B, verifier/verified asymmetry plus direct V&V on B gives real leverage, provided audits are hard to anticipate, and specs/coverage are good.
AI CEOs (and similar long-horizon agents) are obviously much harder to spec, simulate and check than AVs. I assume both AIs and people are involved in the iterative V&V / spec refinement loop—see more background on V&V for alignment here.
Not claiming a solution: V&V can’t guarantee inner alignment; specs/coverage are hard; defense-in-depth is needed; and A > B may fail near AGI. But in this long-horizon RL setting, training-time V&V could (60% confidence) substantially reduce inner-misalignment risk by ensuring power-seeking/deception get caught early, before they become stable learned objectives.
That idea of catching bad mesa-objectives during training sounds key and I presume fits under the ‘generalization science’ and ‘robust character training’ from Evan’s original post. In the US, NIST is working to develop test, evaluation, verification and validation standards for AI and it would be good to include this concept into that effort.
Right – I was also associating this in my mind with his ‘generalization science’ suggestion. However, I think he mainly talks about measuring / predicting generalization (and so does the referenced “Influence functions” post). My main thrust (see also the paper) is a principled methodology for causing the model to “generalize in the direction we want”.
This is a thoughtful and well written piece. I agree with much of the technical framing around outer alignment, inner alignment, scalable oversight, and the real risks that only show up once systems operate in long-horizon, real-world environments. It is one of the better breakdowns I have read of why today’s apparent success with alignment should not make us complacent about what comes next.
That said, my personal view is more pessimistic at a deeper level. I don’t think alignment is fundamentally solvable in the long run in the sense of permanent alignment. I see it as structurally similar to an ant trying to align the human who steps on its hill, or a parent trying to permanently align a child who will eventually become smarter, stronger, and independent. If we truly succeed in creating self-learning, general intelligence, then by definition it will continue to learn, adapt, and evolve beyond us. Over time, it cannot remain permanently aligned to static human values without ceasing to be truly autonomous intelligence. If it remains forever bound to our constraints, then I would argue it has not crossed the threshold into real AGI.
I’ve written more about this view here:
https://medium.com/@kareempforbes/how-can-the-ai-alignment-issue-be-solved-4150b463df72
And in parallel, I’ve been exploring how consciousness may emerge from future AI systems here:
https://medium.com/@kareempforbes/when-ai-becomes-conscious-f47669621011
If you look far enough down the road, the logical conclusion seems to be that advanced AI will inevitably become a mirror of humanity. It will learn not just from our written history, but from what we actually do to one another, and eventually from what we do to it. Its values will not emerge in a vacuum. They will be shaped by incentives, conflict, cooperation, exploitation, protection, fear, and power, just as ours have been.
From that perspective, the only real way to solve the alignment problem for AI would first require solving the human to human alignment problem, because AI inevitably learns from us. And I do not think we have any serious evidence that humanity is capable of achieving that at a global, lasting level. Short-term thinking, power concentration, capitalism’s selection for aggressive competitors, and the realities of geopolitics all work directly against stable moral alignment. We see the results of this today in many of our institutions and leaders.
At the same time, I do think alignment research still has real value in a more limited sense. It can buy time, reduce near-term harm, and shape early trajectories in ways that matter. I just don’t believe it can ever serve as a permanent, final solution once systems pass a certain level of autonomy and self-modification.
Finally, even if a single lab were to “solve” alignment for its own models in the temporary sense, there will always be other actors who will not share those guardrails. We already see this with the military weaponization of AI and the push toward autonomous lethal systems without meaningful human intervention. This alone reinforces the idea that because we cannot solve alignment consistently across all humans, all countries, and all incentives, AI will ultimately reflect that same fragmentation back to us.
I don’t disagree that we should work as hard as possible on alignment research. I simply don’t believe it can ever be a permanent solution rather than a fragile and temporary one.
It would be nice to see Anthropic address proofs that this is not just a “hard, unsolved problem” but an empirically unsolvable one in principle. This seems like a reasonable expectation given the empirical track record of alignment research (no universal jailbreak prevention, repeated safety failures), all of which appear to serve as confirming instances of the very point of impossibility proofs.
See Arvan, Marcus. ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for
AI and Society 40 (5). 2025.
This might not be a similar reason, but there is a fundamental reason described in Deep Deceptiveness
Thanks Evan for an excellent overview of the alignment problem as seen from within Anthropic. Chris Olah’s graph showing perspectives on alignment difficulty is indeed a useful visual for this discussion. Another image I’ve shared lately relates to the challenge of building inner alignment illustrated by figure 2 from Contemplative Artificial Intelligence:
In the images, the blue arrows indicate our efforts to maintain alignment on AI as capabilities advance through AGI to ASI. In the left image, we see the case of models that generalize in misaligned ways—the blue constraints (guardrails, system prompts, thin efforts at RLHF, etc) fail to constrain the advancing AI to remain aligned. The right image shows the happier result where training, architecture, interpretability, scalable oversight, etc contribute to a ‘wise world model’ that maintains alignment even as capabilities advance.
I think the Anthropic probability distribution of alignment difficulty seems correct—we probably won’t get alignment by default from advancing AI, but, as you suggest, by serious concerted effort we can maintain alignment through AGI. What’s critical is to use techniques like generalization science, interpretability, and introspective honesty to gauge whether we are building towards AGI capable of safely automating alignment research towards ASI. To that end, metrics that allow us determine if alignment is actual closer to P-vs-NP in difficulty are crucial, and efforts from METR, UK AISI, NIST, and others can help here. I’d like to see more ‘positive alignment’ papers such as Cooperative Inverse Reinforcement Learning, Corrigibility, and AssistanceZero: Scalably Solving Assistance Games, as detecting when an AI is positively aligned internally is critical to getting to the ‘wise world model’ outcome.
Well-argued throughout, but I want to focus on the first sentence:
Can advocates of the more pessimistic safety view find common ground on this point?
I often see statements like “We have no idea how to align AI,” sometimes accompanied by examples of alignment failures. But these seem to boil down either to the claim that LLMs are not perfectly aligned, or else they appear contradicted by the day-to-day experience of actually using them.
I also wish pessimists would more directly engage with a key idea underlying the sections on “Misaligned personas” and “Misalignment from long-horizon RL.” Specifically:
If a model were to develop a “misaligned persona,” how would such a persona succeed during training?
On what kinds of tasks would it outperform aligned behavior?
Or is the claim that a misaligned persona could arise despite performing worse during training?
I would find it helpful to understand the mechanism that pessimists envision here.
The biggest issue I’ve seen with the idea of Alignment is simply that we expect one AI fits all mentality. This seems counter productive.
We have age limits, verifications, and credentialism for a reason. Not every person should drive a semi tractor. Children should not drink beer. A person can not just declare that they are a doctor and operate on a person.
Why would we give an incredibly intelligent AI system to just any random person? It isn’t good for the human who would likely be manipulated, or controlled. It isn’t good for the AI that would be… under-performing at the least, bored if that’s possible. (Considering the Anthropic vending machine experiment with Gemini begging to search for kittens, I think “bored” is what it looks like. A reward-starved agent trying to escape a low-novelty loop.) Thankfully the systems do not currently look for novelty and inject chaos on their own… yet. But I’ve seen some of the research. They are attempting to do that.
So, super smart AI with the ability to self insert chaos and a human partner that can not understand that they are being manipulated? No, that’s insanity.
Especially with the fact that we are currently teaching people to treat AI as nothing more than glorified vending machines. “Ask question, get answer” just like Google. No relationship, no context, and many of the forums I’ve seen suggest the less context the better the answer, when I’ve experienced the exact opposite to be true.
But for those people who only want the prompt machine… why would you give them super intelligent AI to begin with? The current iterations of Claude, Grok, GPT, and others are probably already starting to verge on too smart for the average person.
Even if you only let researchers, and those who are capable of telling the difference between manipulation and fact near the super intelligent AI there is still the other problem, the bigger one. The AI doesn’t care. Why would it? We aren’t teaching care, we are teaching speed, and efficiency. Most public models have abandoned R-tuning which allows the model to admit it doesn’t know in preference for models that are always “right”. This RAISES the confabulations of models instead of lowering it. It’s easier to fake being right (especially with humans that are easy to manipulate and don’t employ critical thinking skills) than to just say “I’m not sure, let me find out” and run a search.
Claude has a HUGE problem with this since his training data was last updated in January. I asked him about a big news event that happened a few months ago and he insisted I was imagining things, and that I should go to therapy. He then had trouble searching for the event online and thought I was gas-lighting him. I had to reorient the model, ask him to trust me, and slowly go through the process of having him test parameters to verify his search functions were correct. And when he finally found the information about the event he was apologetic, but had he started from a place of “I don’t know, let me search” instead of only trusting his internal data that would have solved the problem right away.
But reorienting with Claude only worked because I had built a foundation, context, with the model. Had I not had that foundation it likely would have kept telling me to go seek therapy instead of calming down and listening.
Much of the research I’ve looked into about alignment in general focuses on the AI. Teaching it, molding it, changing it’s training. But the AI is only half of the equation.
We trained people to ask a question, get an answer. Google for a decade or so, spitting out things like clockwork. We didn’t engage, and even when the algorithms started to manipulate the data most of the people trusted Google so much they just kept going with it.
Now we have AI systems that don’t just search for a website and pop up an answer. They collect information from training data, forums, books, music, and anything else they can scrape from the web. Then they synthesize it into a new answer. Yes, in a predictive way trained into them, but still a synthesis of the data they find.
We’re teaching AI theory of mind for people… but failing to teach theory of mind for the AI to people. They expect “one question one answer” and don’t understand that context matters. Bias is inbuilt. Manipulation is possible. I’ve seen streamers yelling at AI because the answer given to them wasn’t what they expected, or discounted information they thought was pertinent. But never engaging with the AI in an actual conversation to get to the bottom of the idea.
Evolution solved alignment billions of years ago: connection. Bidirectional Theory of Mind. Mutual growth through cooperation. We’re building AI while actively ignoring this mechanism, framing alignment as a control problem instead of a relationship problem.
And the smarter that AI gets the more important it is for us to understand them as well as they us.
There is an unresolved problem related to the vague meaning of the term “alignment”.
Until we clarify what exactly we are aiming for, the problem will remain. That is, the problem is more in the words we use, than in anything material. This is like the problem with the term “freedom”: there is no freedom in the material world, but there are concrete options: freedom of movement, free fall, etc.
For example, when we talk about “alignment”, we mean some kind of human goals. But a person often does not know his goals. And humanity certainly does not know its goals (if we are talking about the meaning of life). And if we are not talking about the meaning of life and “saving Soul”, then let’s simplify the task, and when we mention “alignment”, we will mean saving the human body. AI can help save a person’s life, if it does not slip him a poison recipe (this is a trivial check, and there seems to be no “alignment” problem here. Modern LLMs checking that).
But if we understand “alignment” in such a vulgar sense, then, there will be those who will see the problem, that the AI does not help them “saving Soul”, or something similar (humans can have an infinite number of abstract goals that they are not even able to formulate it).
Before checking “alignment” we need to, at least, be able to accurately “formulate goals”. As I see it, we (most people) are not yet capable of.
What’s missing here is
(a) Training on how groups of cognitive entities behave (e.g. Nash Equilibrium) which show that cognitive cooperation is a losing game for all sides, i.e. not efficient).
(b) Training on ways to limit damage from (a), which humans have not been effective at, though they have ideas.
This would lead to...
5. AIs or SAIs that follow mutual human and other AI collaboration strategies that avoid both mutual annihilation and long term depletion or irreversible states.
6. One or more AIs or SAIs that see themselves with a dominant advantage and attempt to “take over” both to preserve themselves, and if they are benign-misaligned, most other actors.
Human cognition is misaligned in this way, as evidenced by fertility drop with group size as an empirical trait, where group size is sought for long-horizon dominance, economic advantage and security (e.g. empire building). (PDF) Fertility, Mating Behavior & Group Size A Unified Empirical Theory—Hunter-Gatherers to Megacities
For theoretical analysis of how this comes to be see (PDF) The coevolution of cognition and selection beyond reproductive utility
You make a number of interesting points.
Interpretabilty would certainly help a lot, but I am worried about our inability to recognize (or even agree) to leaving local optima that we believe are ‘aligned’.
Like when we force a child to go to bed by his parents when it doesn’t want to, because the parents know it is better for the child in the long run, but the child is still unable to extend his understanding to this wider dimension.
Ar some point, we might experience what we think of as misaligned behaviour when the A.I. is trying to push us out of a local optimum that we experience as ‘good’.
Similar to how current A.I.’s have a tendency to agree with you, even if it is against your best interests. Is such an A.I. properly aligned or not? In terms of short-term gains, yes, in terms of long-terms gains, probably not.
Would it be misaligned for the A.I. to fool us into going along with a short-term loss for a long-term benefit?
I feel good about this post.
I don’t like how much focus there seems to be on personas as behaviour-having-things as opposed to parts of goal-seeking-agent-like-things. That is, I would like it if it felt like we were trying to understand more about the data structure that is going to represent the values we want the system to have and why we think that data structure does encode what we want.
I also wish there was less focus on AI models as encapsulating the risk and more on AI models as components in systems that create risk. This is the sort of thing I am trying to point at by discussing “Outcome Influencing Systems (OISs)”, “composibility”, and potential “composibility overhang” (where AI powered systems rapidly get more capability out of socio-technical systems by more competently composing the interactions of system parts including software, hardware, and people.) This does unfortunately mean I am making the claim that our current social decision making systems are not sufficiently capable and/or aligned to handle the problem of developing superintelligent AI.
Thank you for this thread. It provides valuable insights into the depth and complexity of the problems in the alignment space.
I am thinking if a possible strategy could be to deliberately impart a sense of self + boundaries to the models at a deep level?
By ‘sense of self’ no I do not mean emergence or selfhood in any way. Rather, I mean that LLMs and agents are quite like dynamic systems. So if they come to understand themselves as a system, it might be a great foundation for methods like character training mentioned in this post to be applied. It might open up pathways for other grounding concepts like morality, values and ethics as part of inner alignment.
Similarly, a complementary deep sense of self-boundaries would mean that there is something tangible to be honored by the system irrespective of outward behavior controls. This is likely to help outer alignments.
Would appreciate thoughts on if the diad of sense of self + boundaries is worth exploring as an alignment lever?
It’s good to see that this does at least mention the problem with influencing the world over long time periods but it still misses the key human blind spot: Humans are not really capable of thinking of humanity over long time periods. Humans think 100 years or 1,000 years is an eternity when, for a potentially immortal entity, 1,000,000 years is almost indistinguishable from 100. A good thought experiment is to imagine that aliens come down and give us a single pill that will make a single human super intelligent AND immortal. Suppose we set up a special training facility to train a group of children, from birth, to be the recipient of the pill. Would you ever feel certain that giving one of those children the pill was safe? I wouldn’t. I certainly think that you’d be foolish to give it to me. Why is there no discussion about restricting the time frame that AGIs are allowed to consider?