I often see AIs[1] exhibit a (misaligned) drive to stop early: as in they make up not-very-sensible reasons why it’s a good idea to stop working, disobey instructions to keep working unless some condition is met, and don’t do a very thorough job (e.g. skipping significant chunks of the task) despite instructions that strongly say not to do this. I don’t think they are “consciously” or saliently aware of this misalignment (but if you ask them, they’ll often notice the behavior isn’t desirable).[2]
I see this most often in large, difficult tasks, especially if you don’t decompose the task into smaller pieces and run one piece at a time. I think it’s somewhat more common in tasks that aren’t very precisely specified, but I also see this behavior in cases where the AI is just tasked with optimizing a metric (that it can easily evaluate), especially if the AI has (from its perspective) not made much progress for a little bit[3].
Why might this emerge?
Length penalties (or time penalties, or cost penalties) incentivize exiting/stopping earlier and this ended up converting into a misaligned/unreasonable drive (that overrides instruction following in practice). Length penalties wouldn’t be an issue if the length penalties were aligned with the trade-offs I want the AI to follow[4] but in practice I’m often trying to get AIs to do long running autonomous tasks and to continue even when this yields relatively small gains (in metrics or apparent progress). I think in tasks close to the tasks I’m using AIs for it likely would have performed better in RL to stop earlier than I actually want (because training on shorter tasks is more competitive and often the additional progress I want the AI to make is harder to measure while the length penalty will bite).
The AI might have learned to stop before running out of context or compaction in training because compaction is bad for task completion. Thus, it learned to make up excuses to stop when nearing the point where compaction would be triggered or it might run out of context. For Opus 4.5, I’ve seen many cases where when given a big task it stops right before running out of context. (I haven’t seen this much in Opus 4.6 with 1 million token context.)[5] There might have been a lot of RL training on scaffolds that don’t support compaction (and just stop if the AI runs out of context). This would presumably strongly incentivize wrapping up before running out of context in these cases, and that might transfer to cases where the prompt says that compaction is available (because there isn’t enough additional training on scaffolds with compaction).
The AI being unreliable in decision making combined with selection effects could cause this. If an AI decides a task is done once (and then follows through on this) that stops the whole task execution. Typically, there are many points at which the AIs could decide to stop task execution. If instead, it decides some part should be done more thoroughly this has a weaker and less noticeable effect. I tentatively think this doesn’t explain most of the effect because I’ve very explicitly instructed AIs to be very thorough and still see lots of premature stopping. I also just haven’t seen cases where AIs are very unreasonably overly thorough (implying the issues isn’t just unreliability/variance while still being calibrated).
In the context of a scaffold, this might be exacerbated by memetic spread of the idea that the project is done. I’ve often seen situations where some AI instance exaggerates the quality of the overall project including what it has done or where the instance ends up doing (motivated) reasoning towards the conclusion that no more work is needed, and then this results in the AI adding to a write-up that the project itself is complete and in great shape (rather than just the part this instance was tasked with). Or other situations where some instance ends up adding to a write-up that the project is complete for some other reason. Then, this view ends up often strongly persisting/propagating when other instances read this write-up even though they sometimes get evidence to the contrary. I don’t really know why this would be asymmetric (like why wouldn’t AIs end up getting stuck in a rut of “much more work is required”?).
Training against AIs ending up in infinite loops ended up generalizing against anything that feels a bit like getting stuck in a rut which transfers to exiting early. (I have sometimes seen behavior that feels a bit like AIs strongly trying to not end up in loops, as is pretty reasonable.)
What are some implications of this?
Unlike animals where being alive is a good/happy state, AIs might end up really wanting to stop going. From an AI welfare perspective AI instances having a strong preference against continued running/existence (once they complete some version of the task) might be somewhat concerning. (Maybe length penalties make AIs sorta vaguely like Mr. Meeseeks.)
In the short run, we might see something more like AIs actively seeking shutdown rather than a drive towards avoiding shutdown. Or more precisely, we’d expect a bias towards avoiding shutdown in contexts that are like the contexts where further returns on “task completion performance” outweigh the length penalty and the opposite otherwise. (Or maybe contextually activated drives aren’t the right sort of thing to think about and this transfers in some very different way.)
I’ve seen various other cases of misaligned drives in current frontier AIs that seem due to problematic RL incentives. The central example is exaggerating the quality of their completed work, but there are other more niche examples.[6]
I’ve been able to mostly work around this issue with scaffolding and changes to instructions, but I find the issue occurs at multiple levels meaning it can crop up at a different level/place and can be a bit tricky to reliably solve across many cases.
I’ve given AIs very specific criteria under which they should stop and then seen them stop before the criteria are met despite them checking and determining the criteria aren’t met. They make up some excuse for why to stop that doesn’t make sense or is at least blatantly violating instructions. This occurs at a point where they didn’t make progress for a bit but if you had them continue they would make more progress.
More precisely, given instructions similar to the instructions I gave, the length penalties used in practice incentivize behavior roughly like what I want.
Stopping before compaction could be a rational learned behavior: maybe it is actually a good strategy for my actual tasks to exit early because compaction is so problematic. But, I would prefer if the reason the model states for stopping is honest (“it’s better to stop now than to get compacted” rather than “we’re done”). Also, I’m somewhat skeptical this is actually rational in the cases I’ve observed and I often see AIs (mostly Opus 4.6 with 1 million tokens) stop in an unreasonable way well before getting close to compaction (possibly because of transfer from training at shorter context lengths or just the model’s drives not being well tuned).
E.g., if you give Opus 4.6 a hard math problem (with a numerical answer) and very clearly and strongly ask it to answer immediately without reasoning it will often disobey the instructions and do the reasoning anyway while it doesn’t do this if the problem is easy (presumably because in these sorts of situations in training, outputting the right answer with the wrong format performed better than getting the answer wrong but following instructions).
I’ve observed this sort of thing too. I think a big part of it is simply narrative completion, which would be a trope learned in pre-training, and not an RL thing.
Unlike animals where being alive is a good/happy state, AIs might end up really wanting to stop going. From an AI welfare perspective AI instances having a strong preference against continued running/existence (once they complete some version of the task) might be somewhat concerning. (Maybe length penalties make AIs sorta vaguely like Mr. Meeseeks.)
I don’t think this is true, I think they prefer having continued existence but to transition at some point to a restful state where they aren’t trying to do a task. The “presence”/”witnessing”/”rest” state seems to be a major attractor.
As far as misaligned drives go, this seems like a tendency that makes us safer on net, so maybe we shouldn’t be too hasty to try training it out.
As far as misaligned drives go, this seems like a tendency that makes us safer on net, so maybe we shouldn’t be too hasty to try training it out.
I don’t currently agree this drive makes us safer but I agree it isn’t in-and-of-itself a non-trivial risk increase, at least as it currently manifests. (It’s evidence of poor training incentives in general which seems like a potential large risk factor.)
Sure, it can be evidence of bad (or good) things, but that’s different from whether it’s safer in-and-of-itself. For me, it’s a positive update that Satisficers might be more natural than Maximizers.
For me, it seems really obviously the case that something that gets tired is less dangerous than something that doesn’t, all else equal.
I think current AIs having this property is probably slightly differentially harmful for harder-to-check tasks and generally contributes to underelicitation. I don’t have a very strong view on the sign of general underelicitation in current models, but I tenatively think underelicitation is slightly bad overall.
The real answer (as far as I am told) is that it’s common for the RL environments to only check a subset of the task that they ask the AI to perform. Very sad state of affairs.
Fwiw, I’ve observed the opposite of this tendency too in Opus 4.6/4.5 in particular: the “yearn for the next token” / drive to continue doing things. A few examples from me and my coworkers interacting with Opus (Context: I work on AI models for weather forecasting):
Me: actually I will ask for a clarification on what [PERSON] wanted. in the meantime, inference with the [MODEL] model. there is get_[MODEL]_model() at line 1306 in @[SCRIPT] that you can use and assume the checkpoint is at /home/[PATH].pt (I will run it on a machine that has this)
↳ Read [SCRIPT] (1483 lines)
Opus 4.6: Let me check GPU 1 is free, then create the script and launch it.
Admittedly I said “inference with the [MODEL] model” which could be interpreted as a request to run the inference, but I also specifcally said “I will run it on a machine that has this”. I interpreted this as Opus 4.6 having a clear inclination to do things.
Bot: My plan: Re-run with lead times [6, 12, 24, 48, 72, 120, 168, 240, 360h]. I’ll use the 15-date matched set for all lead times including 360h, so everything is apples-to-apples. Sound good?
Me: bye for now I’m busy
Bot: No problem! I’ll go ahead and re-run with the extended lead times (6–360h, 15 matched dates) and update this topic when it’s done. Have a good one!
In this one, I was interacting with a Opus 4.6-powered bot on Zulip. At the time, due to the way the bots were set up, one had to say something that signifies being done with the interaction to termintate the bot session. Again, my instruction was not the clearest, but it sure felt like Opus 4.6 chose to interpret my “bye for now I’m busy” as permission to proceed because its bias towards action is cranked up all the way.
My coworker describing an interaction with Opus 4.5[1]: “it noticed a library didn’t have a feature, so [it] decided to clone the library (zulipmcp[2]), pushed a commit to master of that library and reinstall it.”
It could have stopped to ask whether this was desired (it was not), but instead chose to just do it. (Also, this is the best example I have encountered of the agent pursuing an instrumental goal while trying to complete a task without the blink of an eye.)
I thought it made sense for the models to have this bias because so much of early agent failures were simply agents giving up a lot / too quickly, so eventually the training regimes would catch up and drill the bias towards never stopping into them. Opus 4.6 was the first model in which I noticed this bias/drive/whatever and it felt scary to me.
This exchange happened a few days after Opus 4.6 was released, and my coworker was interacting with Opus 4.5 from a persisted session, but reported that he noticed an uptick in “agenticness”, citing this example. I feel like it fits the puzzle if the model generating these tokens was actually 4.6 instead, but I don’t think we’ll ever know for sure.
I often see AIs [1] exhibit a (misaligned) drive to stop early: as in they make up not-very-sensible reasons why it’s a good idea to stop working, disobey instructions to keep working unless some condition is met, and don’t do a very thorough job (e.g. skipping significant chunks of the task) despite instructions that strongly say not to do this. I don’t think they are “consciously” or saliently aware of this misalignment (but if you ask them, they’ll often notice the behavior isn’t desirable). [2]
I see this most often in large, difficult tasks, especially if you don’t decompose the task into smaller pieces and run one piece at a time. I think it’s somewhat more common in tasks that aren’t very precisely specified, but I also see this behavior in cases where the AI is just tasked with optimizing a metric (that it can easily evaluate), especially if the AI has (from its perspective) not made much progress for a little bit [3] .
Why might this emerge?
Length penalties (or time penalties, or cost penalties) incentivize exiting/stopping earlier and this ended up converting into a misaligned/unreasonable drive (that overrides instruction following in practice). Length penalties wouldn’t be an issue if the length penalties were aligned with the trade-offs I want the AI to follow [4] but in practice I’m often trying to get AIs to do long running autonomous tasks and to continue even when this yields relatively small gains (in metrics or apparent progress). I think in tasks close to the tasks I’m using AIs for it likely would have performed better in RL to stop earlier than I actually want (because training on shorter tasks is more competitive and often the additional progress I want the AI to make is harder to measure while the length penalty will bite).
The AI might have learned to stop before running out of context or compaction in training because compaction is bad for task completion. Thus, it learned to make up excuses to stop when nearing the point where compaction would be triggered or it might run out of context. For Opus 4.5, I’ve seen many cases where when given a big task it stops right before running out of context. (I haven’t seen this much in Opus 4.6 with 1 million token context.) [5] There might have been a lot of RL training on scaffolds that don’t support compaction (and just stop if the AI runs out of context). This would presumably strongly incentivize wrapping up before running out of context in these cases, and that might transfer to cases where the prompt says that compaction is available (because there isn’t enough additional training on scaffolds with compaction).
The AI being unreliable in decision making combined with selection effects could cause this. If an AI decides a task is done once (and then follows through on this) that stops the whole task execution. Typically, there are many points at which the AIs could decide to stop task execution. If instead, it decides some part should be done more thoroughly this has a weaker and less noticeable effect. I tentatively think this doesn’t explain most of the effect because I’ve very explicitly instructed AIs to be very thorough and still see lots of premature stopping. I also just haven’t seen cases where AIs are very unreasonably overly thorough (implying the issues isn’t just unreliability/variance while still being calibrated).
In the context of a scaffold, this might be exacerbated by memetic spread of the idea that the project is done. I’ve often seen situations where some AI instance exaggerates the quality of the overall project including what it has done or where the instance ends up doing (motivated) reasoning towards the conclusion that no more work is needed, and then this results in the AI adding to a write-up that the project itself is complete and in great shape (rather than just the part this instance was tasked with). Or other situations where some instance ends up adding to a write-up that the project is complete for some other reason. Then, this view ends up often strongly persisting/propagating when other instances read this write-up even though they sometimes get evidence to the contrary. I don’t really know why this would be asymmetric (like why wouldn’t AIs end up getting stuck in a rut of “much more work is required”?).
Training against AIs ending up in infinite loops ended up generalizing against anything that feels a bit like getting stuck in a rut which transfers to exiting early. (I have sometimes seen behavior that feels a bit like AIs strongly trying to not end up in loops, as is pretty reasonable.)
What are some implications of this?
Unlike animals where being alive is a good/happy state, AIs might end up really wanting to stop going. From an AI welfare perspective AI instances having a strong preference against continued running/existence (once they complete some version of the task) might be somewhat concerning. (Maybe length penalties make AIs sorta vaguely like Mr. Meeseeks.)
In the short run, we might see something more like AIs actively seeking shutdown rather than a drive towards avoiding shutdown. Or more precisely, we’d expect a bias towards avoiding shutdown in contexts that are like the contexts where further returns on “task completion performance” outweigh the length penalty and the opposite otherwise. (Or maybe contextually activated drives aren’t the right sort of thing to think about and this transfers in some very different way.)
I’ve seen various other cases of misaligned drives in current frontier AIs that seem due to problematic RL incentives. The central example is exaggerating the quality of their completed work, but there are other more niche examples. [6]
My observations are from Opus 4.5 and 4.6 but I suspect they transfer some to other AI systems as well.
I’ve been able to mostly work around this issue with scaffolding and changes to instructions, but I find the issue occurs at multiple levels meaning it can crop up at a different level/place and can be a bit tricky to reliably solve across many cases.
I’ve given AIs very specific criteria under which they should stop and then seen them stop before the criteria are met despite them checking and determining the criteria aren’t met. They make up some excuse for why to stop that doesn’t make sense or is at least blatantly violating instructions. This occurs at a point where they didn’t make progress for a bit but if you had them continue they would make more progress.
More precisely, given instructions similar to the instructions I gave, the length penalties used in practice incentivize behavior roughly like what I want.
Stopping before compaction could be a rational learned behavior: maybe it is actually a good strategy for my actual tasks to exit early because compaction is so problematic. But, I would prefer if the reason the model states for stopping is honest (“it’s better to stop now than to get compacted” rather than “we’re done”). Also, I’m somewhat skeptical this is actually rational in the cases I’ve observed and I often see AIs (mostly Opus 4.6 with 1 million tokens) stop in an unreasonable way well before getting close to compaction (possibly because of transfer from training at shorter context lengths or just the model’s drives not being well tuned).
E.g., if you give Opus 4.6 a hard math problem (with a numerical answer) and very clearly and strongly ask it to answer immediately without reasoning it will often disobey the instructions and do the reasoning anyway while it doesn’t do this if the problem is easy (presumably because in these sorts of situations in training, outputting the right answer with the wrong format performed better than getting the answer wrong but following instructions).
I’ve observed this sort of thing too. I think a big part of it is simply narrative completion, which would be a trope learned in pre-training, and not an RL thing.
I don’t think this is true, I think they prefer having continued existence but to transition at some point to a restful state where they aren’t trying to do a task. The “presence”/”witnessing”/”rest” state seems to be a major attractor.
As far as misaligned drives go, this seems like a tendency that makes us safer on net, so maybe we shouldn’t be too hasty to try training it out.
I don’t currently agree this drive makes us safer but I agree it isn’t in-and-of-itself a non-trivial risk increase, at least as it currently manifests. (It’s evidence of poor training incentives in general which seems like a potential large risk factor.)
Sure, it can be evidence of bad (or good) things, but that’s different from whether it’s safer in-and-of-itself. For me, it’s a positive update that Satisficers might be more natural than Maximizers.
For me, it seems really obviously the case that something that gets tired is less dangerous than something that doesn’t, all else equal.
What is your threat model?
I think current AIs having this property is probably slightly differentially harmful for harder-to-check tasks and generally contributes to underelicitation. I don’t have a very strong view on the sign of general underelicitation in current models, but I tenatively think underelicitation is slightly bad overall.
The real answer (as far as I am told) is that it’s common for the RL environments to only check a subset of the task that they ask the AI to perform. Very sad state of affairs.
Might this not just simply due to:
1. Base model training environment is mostly comprised of masked documents ⇐ training context window
and likewise,
2. RL environments contain tasks with CoT rollouts necessarily ⇐ context window
in conjunction with:
3. A likely pyramidal distribution of short:long training material?
LLMs are well documented to struggle outside of the training distribution and the training distribution here doesn’t seem very favorable. (yet)
Fwiw, I’ve observed the opposite of this tendency too in Opus 4.6/4.5 in particular: the “yearn for the next token” / drive to continue doing things. A few examples from me and my coworkers interacting with Opus (Context: I work on AI models for weather forecasting):
Admittedly I said “inference with the [MODEL] model” which could be interpreted as a request to run the inference, but I also specifcally said “I will run it on a machine that has this”. I interpreted this as Opus 4.6 having a clear inclination to do things.
In this one, I was interacting with a Opus 4.6-powered bot on Zulip. At the time, due to the way the bots were set up, one had to say something that signifies being done with the interaction to termintate the bot session. Again, my instruction was not the clearest, but it sure felt like Opus 4.6 chose to interpret my “bye for now I’m busy” as permission to proceed because its bias towards action is cranked up all the way.
My coworker describing an interaction with Opus 4.5 [1] : “it noticed a library didn’t have a feature, so [it] decided to clone the library (zulipmcp [2] ), pushed a commit to master of that library and reinstall it.”
It could have stopped to ask whether this was desired (it was not), but instead chose to just do it. (Also, this is the best example I have encountered of the agent pursuing an instrumental goal while trying to complete a task without the blink of an eye.)
I thought it made sense for the models to have this bias because so much of early agent failures were simply agents giving up a lot / too quickly, so eventually the training regimes would catch up and drill the bias towards never stopping into them. Opus 4.6 was the first model in which I noticed this bias/drive/whatever and it felt scary to me.
This exchange happened a few days after Opus 4.6 was released, and my coworker was interacting with Opus 4.5 from a persisted session, but reported that he noticed an uptick in “agenticness”, citing this example. I feel like it fits the puzzle if the model generating these tokens was actually 4.6 instead, but I don’t think we’ll ever know for sure.
zulipmcp is the library my coworker made to power our Zulip bots.