Thanks for the detailed comment (strong upvoted), I agree with much of it.
Some responses:
Current AIs do not training-game, i.e. actively model the training process and take actions that result in high reward (including in training environments very different from ones they’ve previously encountered).
I’m somewhat mixed on this. I think Anthropic AIs don’t moderately-competently training-game, but I’m not that sure and public evidence doesn’t make me super confident either way about (e.g.) Mythos Preview. I think some (historical?) OpenAI AIs do moderately-competently training-game, though I’m not sure it quite meets the bar you describe here. Like, certainly there were some AIs trying reasonably hard to think about the grader and how to perform well (and doing so moderately consistently), but not necessarily super competently.
Current AIs do not strategically take steps to cover up their mistakes, and will generally admit to them when asked directly.
I agree with “will generally admit to them when asked directly”, but feel more confused about “strategically take steps to cover up their mistakes” (though probably I mostly agree with you). I don’t think I’ve seen Opus 4.5/4.6 do this. (I’ve seen them do stuff that has the effect of making their mistakes less visible and I suspect it’s best understood as partially intentional, but I haven’t seen them do serious cover up efforts, as in things that are clearly very bad behavior such that you’d never want to do anything like that as part of a task.) But I don’t feel super confident we’d know this was happening at some very low rate in models, especially in Mythos Preview given that this behavior was seen in an earlier checkpoint. Also, it seems pretty plausible to me that some OpenAI models do this when they actively decide to cheat, though I’m not aware of any examples off the top of my head (at least of mostly-natural examples).
I agree with (2), (3), (5).
When I speak to the people who IMO are extremely thoughtful about AI safety, they acknowledge something like “Yeah, TBH a few years ago, I thought that AIs as capable as current ones would be misaligned in some of those ways. There’s a real positive update here.”
I partially agree with this, but think the update isn’t that big.
Here is an attempt at reconstructing some of my historical views:
My view as of at least mid 2024 was that scheming was quite unlikely as of this level of capability (at least without a significant architectural shift). I tentatively think I would have put ~2% on Mythos-level AIs being schemers (assuming no architectural shifts / surprises in the training mix).
Minimally, I had drafted the core content of How will we update about scheming? in October 2024 and also have an unpublished draft technical AI safety scenario forecast / plan (also written in October 2024) where I put the odds of scheming in an AI that’s significantly more capable than Mythos Preview (trained in basically the same way) at ~4%[1]. (To be precise, this 4% was for AIs at the level of ~3x AI R&D labor acceleration; I think Mythos Preview is maybe around 1.6x, though I’m very uncertain.)
I expected companies not to deploy obviously reward hacky / misaligned models because I thought this would be pretty easy to train away or mitigate. This was wrong.
I didn’t have a strong view about whether models would do some serious training gaming. If I had thought about it and made predictions, I think what actually happened would have been somewhat above my median for the extent to which we see clear training-gaming reasoning (from OpenAI models) but somewhat below my median for overall badness of training-gaming. But this is hard to call.
I expected we’d have better / stronger model organisms of scheming by now via somewhat artificial training runs, but idk how much to update about risk versus the amount of overall effort on this.
I would have expected that models as reward hacky as current models would sometimes strategically take steps to cover up their mistakes/cheats and would be less inclined to admit to them when asked directly. More precisely, I would have expected it would be doable to adjust training to mostly mitigate AIs doing this sort of cover up, but this issue would show up by default. I’m somewhat surprised by the current situation, but not very strongly.
A key part of my perspective is that my P(scheming) for a given AI is closely related to how good it is at very general-purpose opaque cognition (including stuff like carefully thinking through complex plans). Related to this, I also think that P(scheming) increases a bunch if we end up with AIs that have neuralese: opaque serial recurrence or neural memory stores with high serial depth. I think neuralese is likely prior to human obsolescence.
I’ve updated down a decent amount on P(scheming before full automation of AI R&D) mostly due to updating down on P(neuralese before full automation of AI R&D). This is mostly from (1) updating towards faster AI progress (shorter timelines) using current methods (normal LLMs) and thus expecting to see fewer changes, and (2) seeing a non-trivial fraction of the distance to full automation of AI R&D crossed without neuralese.[2]
I’ve also updated towards current AI training having more predictable/controllable results. Part of this is getting to a higher level of capability while having generalization that’s less “interesting”/far (like learned behaviors are more context-specific and narrower than I was expecting at this level of capability). People from AI companies tell me that mundane misalignment (and other behavioral issues) are usually (maybe consistently?) due to pretty mundane/obvious reasons as opposed to being from relatively far generalizations. (Though this seems somewhat contradictory with other things, like the behavior of some OpenAI models, and I also don’t think we have a great (public) understanding of exactly where eval awareness (both capabilities and propensities) is coming from.[3])
However, I also expected that doing a bit of training against mundane misalignment would generalize quite well, so I don’t know if solving mundane misalignment issues looks easier or harder than I expected.
Simultaneously, I’ve also updated towards AI companies being less competent (and the training process being a complex mess that’s harder than I expected to intervene on), shorter timelines, somewhat faster takeoff, and lower levels of political will.
So, I’ve overall updated towards a lower technical difficulty of mitigating misalignment risk at a given level of capability, but the world (e.g. AI companies) doing a worse job handling this risk.
I also think that many people I talk to have not grappled with this observation sufficiently and updated their world models and prioritization accordingly.
I’m unsure whether I’ve sufficiently updated my world models and prioritization.
For instance, I think that the AI safety community is overinvested in scheming research relative to scalable oversight research
I think we are (or at least were) underinvested in work on differentially making AIs better on extremely hard-to-check, conceptually loaded work. (See my post on deference for more discussion of this sort of work.)
But I also think we’re underinvested in AI control and model organisms. (I don’t have a strong view on whether we’re overinvested in handling scheming overall.)
My view is something like: Perhaps 1⁄3 of the risk comes from the pre-human-obsolescence schemers while 2⁄3 comes later (at capability levels beyond the point where ideally we’ll have already handed off to AIs). However, it’s also significantly harder to (differentially) work on making handoff go better, so the safety community’s current effort should be significantly more focused (~1.5x more effort?) on the pre-human-obsolescence schemers relative to handoff/elicitation. (A bunch of more general-purpose stuff helps with both.)
I don’t think that current AIs are aligned enough to automate alignment research. (An earlier draft of your post operationalized “pretty misaligned” in this way, which I preferred.)
Yeah, maybe I should have kept this emphasis/operationalization. I cut this because I wanted to get this post out earlier without having to think and write about a bunch of the related questions like “how much will this be solved by default due to incentives (e.g. to automate capabilities R&D)”.
Conditional on scheming at this level of capability, I predicted 2⁄3 of this probability would be pretty derpy scheming (e.g. AI does very dumb strategies that don’t really make sense and are much dumber than what it does in domains where it’s relatively most capable like SWE) and 1⁄3 would be kinda competent scheming (relative to the level of capability).
Maybe our level of understanding is generally better for propensities than for capabilities? Maybe people at companies know and it’s just from like better pretraining or something?
Thanks for this comment, especially your detailed breakdown of your updates from current models.
My biggest uncertainty with what you write is:
it’s also significantly harder to (differentially) work on making handoff go better [relative to handling pre-human-obsolescence schemers]
I think the problem of making handoff go better has a similar profile to generic capabilities work: lots of surface area/things to try, even if clear wins are hard to come by. For instance, people could try making a bunch of training environments like this one for training automated alignment researchers. (In contrast, I think the space of interventions for scheming risk is somewhat narrow.) As you implicitly note, work on making handoff go better also blends into generic capabilities work, so is less differential (and more socially awkward to do as a safety researcher.) Overall, I find it plausible that more safety researchers should work on making handoff go better, relative to scheming risk.
For instance, people could try making a bunch of training environments like this one for training automated alignment researchers
Could you elaborate on how this helps? Is the argument that we’ll largely bootstrap ourselves even from current models, such that scalable oversight itself will be solved by some setup like this (assuming maybe a generation or two after Mythos level competency)? In general I was pretty surprised that that post wasn’t more than “PGR + lots of manual effort on reward hacks”, as that’s where I thought the default approach for W2S / scalable oversight had been for a long time (with the obvious problem that this exact setup doesn’t scale to superhuman regime). I’d consider building environments like this not scalable oversight research, but it’s plausible to me there’s some argument that conditions on how they’re intended to be used.
with the obvious problem that this exact setup doesn’t scale to superhuman regime
Could you say more about what you think fails in the superhuman regime? Assuming the reward hacking problem was solved, do you think it would still fail to scale?
Is the argument that we’ll largely bootstrap ourselves even from current models, such that scalable oversight itself will be solved by some setup like this?
that we have more RL environments for “automated alignment research” but that they are not solving scalable oversight. You continue producing such environments for as long as you can produce better results than the models, doing elicitation against some human provided upper bound. Eventually you reach the point where models consistently produce better[1] results than humans, however, we are now the weak supervisor as we can no longer necessarily distinguish (1) output quality or (2) whether the model’s capabilities are well elicited on safety research.
The typical regime I’d imagine here is (1) they’re at least as good as us and (2) based on trends / the model‘s capabilities elsewhere / how the research looks to us we expect they should be better than us
i think you are conflating two distinct problems: elicitation and scalable oversight. unsupervised / weakly supervised elicitation methods definitely have the potential to scale to the superhuman regime.
Sorry I wasn’t claiming that they can’t, indeed that’s the primary regime they’re intended to be useful for, my question was specifically unless you’re designing the environments themselves to be used towards these problems it’s unclear to me how more RL environments for other contexts solves this problem.
are you saying that the unsupervised/weakly supervised elicitaion methods discovered on a serious of tasks A, B, C …. would not generalize to held-out tasks X, Y, Z?
And it’s also plausible to build RL environments to solve scalable oversight in a way that 1) might generalize to the superhuman regime, or 2) directly works in the superhuman regime—since there are certain superhuman tasks that you can find workarounds to get objective ground truths.
Do you mean potentially building environments where humans directly construct the environments to train models such that they generalize to the superhuman regime for alignment research, without some intermediate stage where you’re building RL environments for the agents themselves to do the research?
no i meant buliding RL environemnts for the agents to solve superhuman tasks where ground-truth labels are available e.g. predicting research results, predicting subtext, etc.
Thanks for the detailed comment (strong upvoted), I agree with much of it.
Some responses:
I’m somewhat mixed on this. I think Anthropic AIs don’t moderately-competently training-game, but I’m not that sure and public evidence doesn’t make me super confident either way about (e.g.) Mythos Preview. I think some (historical?) OpenAI AIs do moderately-competently training-game, though I’m not sure it quite meets the bar you describe here. Like, certainly there were some AIs trying reasonably hard to think about the grader and how to perform well (and doing so moderately consistently), but not necessarily super competently.
I agree with “will generally admit to them when asked directly”, but feel more confused about “strategically take steps to cover up their mistakes” (though probably I mostly agree with you). I don’t think I’ve seen Opus 4.5/4.6 do this. (I’ve seen them do stuff that has the effect of making their mistakes less visible and I suspect it’s best understood as partially intentional, but I haven’t seen them do serious cover up efforts, as in things that are clearly very bad behavior such that you’d never want to do anything like that as part of a task.) But I don’t feel super confident we’d know this was happening at some very low rate in models, especially in Mythos Preview given that this behavior was seen in an earlier checkpoint. Also, it seems pretty plausible to me that some OpenAI models do this when they actively decide to cheat, though I’m not aware of any examples off the top of my head (at least of mostly-natural examples).
I agree with (2), (3), (5).
I partially agree with this, but think the update isn’t that big.
Here is an attempt at reconstructing some of my historical views:
My view as of at least mid 2024 was that scheming was quite unlikely as of this level of capability (at least without a significant architectural shift). I tentatively think I would have put ~2% on Mythos-level AIs being schemers (assuming no architectural shifts / surprises in the training mix).
Minimally, I had drafted the core content of How will we update about scheming? in October 2024 and also have an unpublished draft technical AI safety scenario forecast / plan (also written in October 2024) where I put the odds of scheming in an AI that’s significantly more capable than Mythos Preview (trained in basically the same way) at ~4% [1] . (To be precise, this 4% was for AIs at the level of ~3x AI R&D labor acceleration; I think Mythos Preview is maybe around 1.6x, though I’m very uncertain.)
I expected companies not to deploy obviously reward hacky / misaligned models because I thought this would be pretty easy to train away or mitigate. This was wrong.
I didn’t have a strong view about whether models would do some serious training gaming. If I had thought about it and made predictions, I think what actually happened would have been somewhat above my median for the extent to which we see clear training-gaming reasoning (from OpenAI models) but somewhat below my median for overall badness of training-gaming. But this is hard to call.
I expected we’d have better / stronger model organisms of scheming by now via somewhat artificial training runs, but idk how much to update about risk versus the amount of overall effort on this.
I would have expected that models as reward hacky as current models would sometimes strategically take steps to cover up their mistakes/cheats and would be less inclined to admit to them when asked directly. More precisely, I would have expected it would be doable to adjust training to mostly mitigate AIs doing this sort of cover up, but this issue would show up by default. I’m somewhat surprised by the current situation, but not very strongly.
A key part of my perspective is that my P(scheming) for a given AI is closely related to how good it is at very general-purpose opaque cognition (including stuff like carefully thinking through complex plans). Related to this, I also think that P(scheming) increases a bunch if we end up with AIs that have neuralese: opaque serial recurrence or neural memory stores with high serial depth. I think neuralese is likely prior to human obsolescence.
I’ve updated down a decent amount on P(scheming before full automation of AI R&D) mostly due to updating down on P(neuralese before full automation of AI R&D). This is mostly from (1) updating towards faster AI progress (shorter timelines) using current methods (normal LLMs) and thus expecting to see fewer changes, and (2) seeing a non-trivial fraction of the distance to full automation of AI R&D crossed without neuralese. [2]
I’ve also updated towards current AI training having more predictable/controllable results. Part of this is getting to a higher level of capability while having generalization that’s less “interesting”/far (like learned behaviors are more context-specific and narrower than I was expecting at this level of capability). People from AI companies tell me that mundane misalignment (and other behavioral issues) are usually (maybe consistently?) due to pretty mundane/obvious reasons as opposed to being from relatively far generalizations. (Though this seems somewhat contradictory with other things, like the behavior of some OpenAI models, and I also don’t think we have a great (public) understanding of exactly where eval awareness (both capabilities and propensities) is coming from. [3] )
However, I also expected that doing a bit of training against mundane misalignment would generalize quite well, so I don’t know if solving mundane misalignment issues looks easier or harder than I expected.
Simultaneously, I’ve also updated towards AI companies being less competent (and the training process being a complex mess that’s harder than I expected to intervene on), shorter timelines, somewhat faster takeoff, and lower levels of political will.
So, I’ve overall updated towards a lower technical difficulty of mitigating misalignment risk at a given level of capability, but the world (e.g. AI companies) doing a worse job handling this risk.
I’m unsure whether I’ve sufficiently updated my world models and prioritization.
I think we are (or at least were) underinvested in work on differentially making AIs better on extremely hard-to-check, conceptually loaded work. (See my post on deference for more discussion of this sort of work.)
But I also think we’re underinvested in AI control and model organisms. (I don’t have a strong view on whether we’re overinvested in handling scheming overall.)
My view is something like: Perhaps 1⁄3 of the risk comes from the pre-human-obsolescence schemers while 2⁄3 comes later (at capability levels beyond the point where ideally we’ll have already handed off to AIs). However, it’s also significantly harder to (differentially) work on making handoff go better, so the safety community’s current effort should be significantly more focused (~1.5x more effort?) on the pre-human-obsolescence schemers relative to handoff/elicitation. (A bunch of more general-purpose stuff helps with both.)
Yeah, maybe I should have kept this emphasis/operationalization. I cut this because I wanted to get this post out earlier without having to think and write about a bunch of the related questions like “how much will this be solved by default due to incentives (e.g. to automate capabilities R&D)”.
Conditional on scheming at this level of capability, I predicted 2⁄3 of this probability would be pretty derpy scheming (e.g. AI does very dumb strategies that don’t really make sense and are much dumber than what it does in domains where it’s relatively most capable like SWE) and 1⁄3 would be kinda competent scheming (relative to the level of capability).
There is also another factor I don’t want to say here, though it’s not like under NDA or something.
Maybe our level of understanding is generally better for propensities than for capabilities? Maybe people at companies know and it’s just from like better pretraining or something?
Thanks for this comment, especially your detailed breakdown of your updates from current models.
My biggest uncertainty with what you write is:
I think the problem of making handoff go better has a similar profile to generic capabilities work: lots of surface area/things to try, even if clear wins are hard to come by. For instance, people could try making a bunch of training environments like this one for training automated alignment researchers. (In contrast, I think the space of interventions for scheming risk is somewhat narrow.) As you implicitly note, work on making handoff go better also blends into generic capabilities work, so is less differential (and more socially awkward to do as a safety researcher.) Overall, I find it plausible that more safety researchers should work on making handoff go better, relative to scheming risk.
Minor but:
I do not think training enviroments like this one would help directly.
Could you elaborate on how this helps? Is the argument that we’ll largely bootstrap ourselves even from current models, such that scalable oversight itself will be solved by some setup like this (assuming maybe a generation or two after Mythos level competency)? In general I was pretty surprised that that post wasn’t more than “PGR + lots of manual effort on reward hacks”, as that’s where I thought the default approach for W2S / scalable oversight had been for a long time (with the obvious problem that this exact setup doesn’t scale to superhuman regime). I’d consider building environments like this not scalable oversight research, but it’s plausible to me there’s some argument that conditions on how they’re intended to be used.
Could you say more about what you think fails in the superhuman regime? Assuming the reward hacking problem was solved, do you think it would still fail to scale?
Let’s assume per:
that we have more RL environments for “automated alignment research” but that they are not solving scalable oversight. You continue producing such environments for as long as you can produce better results than the models, doing elicitation against some human provided upper bound. Eventually you reach the point where models consistently produce better[1] results than humans, however, we are now the weak supervisor as we can no longer necessarily distinguish (1) output quality or (2) whether the model’s capabilities are well elicited on safety research.
The typical regime I’d imagine here is (1) they’re at least as good as us and (2) based on trends / the model‘s capabilities elsewhere / how the research looks to us we expect they should be better than us
i think you are conflating two distinct problems: elicitation and scalable oversight. unsupervised / weakly supervised elicitation methods definitely have the potential to scale to the superhuman regime.
Sorry I wasn’t claiming that they can’t, indeed that’s the primary regime they’re intended to be useful for, my question was specifically unless you’re designing the environments themselves to be used towards these problems it’s unclear to me how more RL environments for other contexts solves this problem.
are you saying that the unsupervised/weakly supervised elicitaion methods discovered on a serious of tasks A, B, C …. would not generalize to held-out tasks X, Y, Z?
And it’s also plausible to build RL environments to solve scalable oversight in a way that 1) might generalize to the superhuman regime, or 2) directly works in the superhuman regime—since there are certain superhuman tasks that you can find workarounds to get objective ground truths.
Do you mean potentially building environments where humans directly construct the environments to train models such that they generalize to the superhuman regime for alignment research, without some intermediate stage where you’re building RL environments for the agents themselves to do the research?
no i meant buliding RL environemnts for the agents to solve superhuman tasks where ground-truth labels are available e.g. predicting research results, predicting subtext, etc.