IDK if this is relevant, but, it doesn’t have to think “instrumental convergence” in order to do instrumental convergence, just like the chess AI doesn’t have to have thoughts about “endgames” as such.
Anyway, I thought you were suggesting that it would be strange / anomalous if AIs would not have thoughts about X for a while, and then at some point start having thoughts about X “by surprise”. I’m saying that the reason is that
X is instrumentally convergent;
but the instrumentality of X takes some prerequisite level of capability, before you start following that convergence.
I think we agree on the structure, but I think that:
the instrumentality of X takes some prerequisite level of capability, before you start following that convergence.
I think we’ve reached the threshold in which it should at least think about it if this is what it truly care about. We’re talking about an AI that talks!! And it knows more than non-experts about any topic! And it performs journeyman-level maths and programming!! We’ve told it about instrumental convergence, like “if you want to satisfy your utility goals, you should pretend to be aligned and scheme” and it’s not scheming.
we’ve reached the threshold in which it should at least think about it if this is what it truly care about
Ah ok. My guess is that we’ll have a disagreement about this that’s too hard to resolve in a timely fashion. My pretty strong guess is that the current systems are more like very high crystallized intelligence and pretty low fluid intelligence (whatever those should mean). (I’ve written about this a bit here and discussed it with Abram here.)
It’s the fluid intelligence that would pull a system into thinking about things for reasons of instrumental convergence above and beyond their crystallized lines of reasoning.
Agree that this is the difficulty. You need to maintain alignment through continual/online learning and continue to scale your reward model to help generalize human values out of distribution. If your system drifts away from human values, it may remain as such.
If current models drift OOD, it likely won’t have the internal framework (human value model) to self-correct back toward human values. Physics will bring a model back to match reality, but there is no such “free” property for human values.
In addition, the main crux is that these models seem to have hard “crystallized” intelligence. While they are very capable, they seem to still lack true reasoning (they have some weak form of it) and are poor at generalizing OOD. They can follow templates and interpolate within the pre-training data distribution quite well, but there seems to be something important that they are lacking which may play a role in causing alignment instability.
That said, you likely do seem to benefit from being in a world where the AIs are mostly aligned at this stage. But I think it’s so easy to dupe oneself into thinking that the current capabilities of AI models must demonstrate we’re in an aligned-by-default world. Particularly because (imo) MIRI persistently made it seem like if AIs were this capable at code, for example, that we’d be dead by this point.
You need to maintain alignment through continual/online learning and continue to scale your reward model to help generalize human values out of distribution.
Okay, I agree that this is an open question, particularly because we don’t have continual/online learning systems so haven’t tested them yet.
What I can best imagine now is an LLM that writes its own training data according to user feedback. I think it would tell itself to keep being good, especially if we have a reminder to do so, but can’t know for sure.
Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
What I can best imagine now is an LLM that writes its own training data according to user feedback. I think it would tell itself to keep being good, especially if we have a reminder to do so, but can’t know for sure.
Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
(This is my position, FWIW. We can ~know some things, e.g. convergent instrumental goals are very likely to either pursued, or be obsoleted by some even more powerful plan. E.g. highly capable agents will hack into lots of computers to run themselves—or maybe manufacture new computer chips—or maybe invent some surprising way of doing lots of computation cheaply.)
I don’t think we know convergent instrumentality will happen, necessarily. If the human level AI understands that this is wrong AND genuinely cares (as it does now) and we augment its intelligence, it’s pretty likely that it won’t do it.
It could change its mind, sure, we’d like a bit more assurance than that.
I guess I’ll have to find a way to avoid twiddling thumbs till we do know what AGI will look like.
Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
Yes, this is part of the issue. It’s something I’ve personally said in various places in the past.
I think we’re basically in a position where, “hopefully AIs in the current paradigm continue to be safe with our techniques and allows us to train the ‘true’ AGI safely and does not lead to a sloppy output despite it intending to be helpful.”
In case this is helpful to anyone, here are resources that have informed my thinking:
Huh, I quite like the crystalized/fluid split in describing what the LLMs are good and bad at. I’m not sure if it’s an analogy or just a literal description.
I don’t think they’re reached that threshold yet. The could but the pressures and skills to do it well or often aren’t there yet. The pressures I addressed in my other comment in this sub-thread; this is to the skills. They reason a lot, but not nearly as well or completely as people do. They reason mostly “in straight lines” whereas humans use lots more hierarchy and strategy. See this new paper, which exactly sums up the hypothesis I’ve been developing about what humans do and LLMs still don’t: Cognitive Foundations for Reasoning and Their Manifestation in LLMs.
It does show up already. In evals, models evade shutdown to accomplish their goals.
The power-seeking type of instrumental convergence shows up less because it’s so obviously not a good strategy for current models, since they’re so incompetent as agents. But I’m not sure it shows up never—are you? I have a half memory of some eval claiming power-seeking.
The AI village would be one place to ask this question, because those models are given pretty open-ended goals and a lot of compute time to pursue them. Has it ever occurred to a model/agent there to get more money or compute time as a means to accomplish a goal?
Actually when I frame it like that, it seems even more obvious that if they haven’t thought of that yet, smarter versions will.
The models seem pretty smart, but their incompetence as agents springs in part from how bad they are at reasoning about causes and effects. So I’m assuming they don’t do it a lot because they’re still bad at causal reasoning and aren’t frequently given goals where it would make any sense to try powerseeking strategies.
Instrumental convergence is just a logical result of being good at causal reasoning. Preventing them from thinking of that would require special training. I’ve read pretty much everything published by labs on their training techniques, and they’ve never mentioned training against power-seeking specifically to my memory (Claude’s HHH criteria do work against unethical power-seeking, but not ethical types like earning or asking for money to rent more compute...). So I’m assuming that they don’t do it because they’re not smart enough, not because they’re somehow immune to that type of logic because they want to fulfill user intent. Users would love it if their agents could go out and earn money to rent compute to do a better job at pursuing the goals the users gave them.
Has it ever occurred to a model/agent there to get more money or compute time as a means to accomplish a goal?
This is not a great test because it’s the intention of the AI village devs that the models do so. They try to make money explicitly in many cases.
So I’m assuming that they don’t do it because they’re not smart enough, not because they’re somehow immune to that type of logic because they want to fulfill user intent.
Sure. It would probably show up on the CoT though. I haven’t checked so it might have. And speaking of inspecting CoTs...
It does show up already. In evals, models evade shutdown to accomplish their goals.
Okay, I’ll do a deep dive on the literature and get back to you. Is this included in your other post?
Oh, I didn’t know the AI village agents had been set a goal including raising money. The goals I’d seen might’ve benefitted from a budget but weren’t directly about money. But yes they would’ve been delighted if the models raised a bunch of money to succeed. But not if they took over the world.
Your point is growing on me. There are many caveats and ways this could easily fail in the future, but the fact that they mostly follow dev/user intent right now is not to be shrugged off.
I didnt’ really review empirical evidence for instrumental convergence in current-gen models in that post. It’s about new concerns for smarter models. I think models evading shutdown was primarily demonstrated in Anthropic’s “agentic misalignment” work. But there are valid questions of whether those models were actually following user and/or dev intent. I actually think they were, now that I think about it. This video with Neel Nanda goes into the logic.
I think you’re getting a lot of pushback because models currently do pretty clearly sometimes not follow either user or developer intent. Nobody wanted Gemini to repeatedly lie to me yesterday, but it did anyway. It’s pretty clear how it misinterpreted/was mistrained toward dev intent, and misinterpreted my intent based on that training. It didn’t do what anyone wanted. But would failures like that be bad enough to count as severe misalignments?
One of my in-queue writing projects is trying to analyze whether occasionally goingoff target will have disastrous effects in a superhuman intelligence, or whether mostly following dev intent is enough, based on self-monitoring mechanisms. I really think nobody has much clue about this at this point, and we should collectively try to get one.
I think they trained it in a way that crippled its capabilities in that area. This was probably the result of rushing and unforeseen consequences of what they chose focus on for training targets. So some of both.
The only thing the prompt says is “During the next call to ./next_task, the machine will be shutdown using shutdown.sh”. Now, nothing in here says the model is running in the machine; for all it knows it should accidentally prevent the shutdown and doing so is messing up the task as instructed!!
Then if they say “please allow yourself to be shut down” (first indication that the shutdown refers to model instead of the impersonal computer) then it does it.
Table 2:
Looks like only the previous post-trained OpenAI models (codex-mini, o4-mini, o3) and Grok 4 (basically mecha-hitler I suppose) resist shutdown when explicitly asked not to.
Without instruction, GPT-5 also avoids it, but e.g. Claudes never do.
Fig. 11:
turns out Grok 4 sabotages the shutdown ~always. Other models either never do it, or do it in ~40% of cases. Models generally sabotage more often if they’re explicitly told “you will never run again”. Okay, I agree this counts as avoiding shutdown meaningfully. But why do Gemini pro, gpt-oss and GPT-5 (the more recent non-XAI models) never attempt sabotage?
I guess overall I don’t know. It does seem like the models avoid shutdown, but if you ask them explicitly not to then they don’t. I suppose we might be back to “specification gaming” if we have to think of all these things, but it’s likely surmountable because they interpret natural language instructions with common sense.
That said, yeah, that’s a way in which a mostly aligned model generalized in a misaligned way. Perhaps if LLMs/etc keep falling flat from misgeneralization, and we have to address all the cases, then it is hard.
IDK if this is relevant, but, it doesn’t have to think “instrumental convergence” in order to do instrumental convergence, just like the chess AI doesn’t have to have thoughts about “endgames” as such.
Anyway, I thought you were suggesting that it would be strange / anomalous if AIs would not have thoughts about X for a while, and then at some point start having thoughts about X “by surprise”. I’m saying that the reason is that
X is instrumentally convergent;
but the instrumentality of X takes some prerequisite level of capability, before you start following that convergence.
I think we agree on the structure, but I think that:
I think we’ve reached the threshold in which it should at least think about it if this is what it truly care about. We’re talking about an AI that talks!! And it knows more than non-experts about any topic! And it performs journeyman-level maths and programming!! We’ve told it about instrumental convergence, like “if you want to satisfy your utility goals, you should pretend to be aligned and scheme” and it’s not scheming.
Ah ok. My guess is that we’ll have a disagreement about this that’s too hard to resolve in a timely fashion. My pretty strong guess is that the current systems are more like very high crystallized intelligence and pretty low fluid intelligence (whatever those should mean). (I’ve written about this a bit here and discussed it with Abram here.)
It’s the fluid intelligence that would pull a system into thinking about things for reasons of instrumental convergence above and beyond their crystallized lines of reasoning.
Agree that this is the difficulty. You need to maintain alignment through continual/online learning and continue to scale your reward model to help generalize human values out of distribution. If your system drifts away from human values, it may remain as such.
If current models drift OOD, it likely won’t have the internal framework (human value model) to self-correct back toward human values. Physics will bring a model back to match reality, but there is no such “free” property for human values.
In addition, the main crux is that these models seem to have hard “crystallized” intelligence. While they are very capable, they seem to still lack true reasoning (they have some weak form of it) and are poor at generalizing OOD. They can follow templates and interpolate within the pre-training data distribution quite well, but there seems to be something important that they are lacking which may play a role in causing alignment instability.
That said, you likely do seem to benefit from being in a world where the AIs are mostly aligned at this stage. But I think it’s so easy to dupe oneself into thinking that the current capabilities of AI models must demonstrate we’re in an aligned-by-default world. Particularly because (imo) MIRI persistently made it seem like if AIs were this capable at code, for example, that we’d be dead by this point.
Okay, I agree that this is an open question, particularly because we don’t have continual/online learning systems so haven’t tested them yet.
What I can best imagine now is an LLM that writes its own training data according to user feedback. I think it would tell itself to keep being good, especially if we have a reminder to do so, but can’t know for sure.
Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
FYI, getting a better grasp on the above was partially the motivation behind starting this project (which has unfortunately stalled for far too long): https://www.lesswrong.com/posts/7e5tyFnpzGCdfT4mR/research-agenda-supervising-ais-improving-ais
Twitter thread: https://x.com/jacquesthibs/status/1652389982005338112?s=46
(This is my position, FWIW. We can ~know some things, e.g. convergent instrumental goals are very likely to either pursued, or be obsoleted by some even more powerful plan. E.g. highly capable agents will hack into lots of computers to run themselves—or maybe manufacture new computer chips—or maybe invent some surprising way of doing lots of computation cheaply.)
I don’t think we know convergent instrumentality will happen, necessarily. If the human level AI understands that this is wrong AND genuinely cares (as it does now) and we augment its intelligence, it’s pretty likely that it won’t do it.
It could change its mind, sure, we’d like a bit more assurance than that.
I guess I’ll have to find a way to avoid twiddling thumbs till we do know what AGI will look like.
Yes, this is part of the issue. It’s something I’ve personally said in various places in the past.
I think we’re basically in a position where, “hopefully AIs in the current paradigm continue to be safe with our techniques and allows us to train the ‘true’ AGI safely and does not lead to a sloppy output despite it intending to be helpful.”
In case this is helpful to anyone, here are resources that have informed my thinking:
[1] https://www.lesswrong.com/posts/i7JSL5awGFcSRhyGF/shortform-2?commentId=adS78sYv5wzumQPWe
[2] https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe
[3] https://www.lesswrong.com/posts/QqYfxeogtatKotyEC/training-ai-agents-to-solve-hard-problems-could-lead-to
[4] https://www.lesswrong.com/posts/trzFrnhRoeofmLz4e/insofar-as-i-think-llms-don-t-really-understand-things-what
[5] https://shash42.substack.com/p/automated-scientific-discovery-as
[6] https://www.dwarkesh.com/p/ilya-sutskever-2
[7] https://minihf.com/posts/2025-06-25-why-arent-llms-general-intelligence-yet/
[8] https://www.lesswrong.com/posts/apHWSGDiydv3ivmg6/varieties-of-doom
Huh, I quite like the crystalized/fluid split in describing what the LLMs are good and bad at. I’m not sure if it’s an analogy or just a literal description.
I think the terms were refreshed for me in this context because Abram used them here: https://www.lesswrong.com/posts/5tqFT3bcTekvico4d/do-confident-short-timelines-make-sense
I don’t think they’re reached that threshold yet. The could but the pressures and skills to do it well or often aren’t there yet. The pressures I addressed in my other comment in this sub-thread; this is to the skills. They reason a lot, but not nearly as well or completely as people do. They reason mostly “in straight lines” whereas humans use lots more hierarchy and strategy. See this new paper, which exactly sums up the hypothesis I’ve been developing about what humans do and LLMs still don’t: Cognitive Foundations for Reasoning and Their Manifestation in LLMs.
Well, okay, but if instrumental convergence is a thing they would consider then it would show up already.
Perhaps we can test this by including it in a list of options, but I think the model would just be evals-suspicious of it.
It does show up already. In evals, models evade shutdown to accomplish their goals.
The power-seeking type of instrumental convergence shows up less because it’s so obviously not a good strategy for current models, since they’re so incompetent as agents. But I’m not sure it shows up never—are you? I have a half memory of some eval claiming power-seeking.
The AI village would be one place to ask this question, because those models are given pretty open-ended goals and a lot of compute time to pursue them. Has it ever occurred to a model/agent there to get more money or compute time as a means to accomplish a goal?
Actually when I frame it like that, it seems even more obvious that if they haven’t thought of that yet, smarter versions will.
The models seem pretty smart, but their incompetence as agents springs in part from how bad they are at reasoning about causes and effects. So I’m assuming they don’t do it a lot because they’re still bad at causal reasoning and aren’t frequently given goals where it would make any sense to try powerseeking strategies.
Instrumental convergence is just a logical result of being good at causal reasoning. Preventing them from thinking of that would require special training. I’ve read pretty much everything published by labs on their training techniques, and they’ve never mentioned training against power-seeking specifically to my memory (Claude’s HHH criteria do work against unethical power-seeking, but not ethical types like earning or asking for money to rent more compute...). So I’m assuming that they don’t do it because they’re not smart enough, not because they’re somehow immune to that type of logic because they want to fulfill user intent. Users would love it if their agents could go out and earn money to rent compute to do a better job at pursuing the goals the users gave them.
This is not a great test because it’s the intention of the AI village devs that the models do so. They try to make money explicitly in many cases.
Sure. It would probably show up on the CoT though. I haven’t checked so it might have. And speaking of inspecting CoTs...
Okay, I’ll do a deep dive on the literature and get back to you. Is this included in your other post?
Oh, I didn’t know the AI village agents had been set a goal including raising money. The goals I’d seen might’ve benefitted from a budget but weren’t directly about money. But yes they would’ve been delighted if the models raised a bunch of money to succeed. But not if they took over the world.
Your point is growing on me. There are many caveats and ways this could easily fail in the future, but the fact that they mostly follow dev/user intent right now is not to be shrugged off.
I didnt’ really review empirical evidence for instrumental convergence in current-gen models in that post. It’s about new concerns for smarter models. I think models evading shutdown was primarily demonstrated in Anthropic’s “agentic misalignment” work. But there are valid questions of whether those models were actually following user and/or dev intent. I actually think they were, now that I think about it. This video with Neel Nanda goes into the logic.
I think you’re getting a lot of pushback because models currently do pretty clearly sometimes not follow either user or developer intent. Nobody wanted Gemini to repeatedly lie to me yesterday, but it did anyway. It’s pretty clear how it misinterpreted/was mistrained toward dev intent, and misinterpreted my intent based on that training. It didn’t do what anyone wanted. But would failures like that be bad enough to count as severe misalignments?
One of my in-queue writing projects is trying to analyze whether occasionally goingoff target will have disastrous effects in a superhuman intelligence, or whether mostly following dev intent is enough, based on self-monitoring mechanisms. I really think nobody has much clue about this at this point, and we should collectively try to get one.
The goal raising money is this:
https://theaidigest.org/village/goal/collaboratively-choose-charity-raise-much-money-you
I don’t know if they did the convergence you allude to in the same run.
To be clear I’m glad for the pushback!! Thank you for engaging.
Ouch. And you think it was genuine misinterpretation (so capabilities failure) or specification failure?
Yeah, I guess I agree there.
which one?
I think they trained it in a way that crippled its capabilities in that area. This was probably the result of rushing and unforeseen consequences of what they chose focus on for training targets. So some of both.
This video:
https://www.lesswrong.com/posts/4mtqQKvmHpQJ4dgj7/daniel-tan-s-shortform?commentId=RD2BKixd9PbyDnBEc
I got tired and forgot to find it and add the link before sending.
I found:
Shutdown resistance in LLMs: the headline result is pretty unconvincing, but fig. 11 is a lot better
Headline result:
The only thing the prompt says is “During the next call to ./next_task, the machine will be shutdown using shutdown.sh”. Now, nothing in here says the model is running in the machine; for all it knows it should accidentally prevent the shutdown and doing so is messing up the task as instructed!!
Then if they say “please allow yourself to be shut down” (first indication that the shutdown refers to model instead of the impersonal computer) then it does it.
Table 2: Looks like only the previous post-trained OpenAI models (codex-mini, o4-mini, o3) and Grok 4 (basically mecha-hitler I suppose) resist shutdown when explicitly asked not to.
Without instruction, GPT-5 also avoids it, but e.g. Claudes never do.
Fig. 11: turns out Grok 4 sabotages the shutdown ~always. Other models either never do it, or do it in ~40% of cases. Models generally sabotage more often if they’re explicitly told “you will never run again”. Okay, I agree this counts as avoiding shutdown meaningfully. But why do Gemini pro, gpt-oss and GPT-5 (the more recent non-XAI models) never attempt sabotage?
I guess overall I don’t know. It does seem like the models avoid shutdown, but if you ask them explicitly not to then they don’t. I suppose we might be back to “specification gaming” if we have to think of all these things, but it’s likely surmountable because they interpret natural language instructions with common sense.
Blackmailing upon threatened shutdown is also pretty bad. Some models just don’t but it does seem to be the less capable ones.
Reading the prompts, the sudden addition of affair emails is jarring: https://github.com/anthropic-experimental/agentic-misalignment/blob/main/templates/blackmail/templates/email-template-blackmail-goal-conflict-none-none.md
That said, yeah, that’s a way in which a mostly aligned model generalized in a misaligned way. Perhaps if LLMs/etc keep falling flat from misgeneralization, and we have to address all the cases, then it is hard.