The problem with this pessimistic position is that it mistakes a vague conceptual argument about high-level incentives—one that masks many hidden assumptions—for definitive proof.
Given this, I think it’s maybe critically important to nail down the Convergent Consequentialst Cognition Thesis, if Dario wants a proof before he’ll buy the conceptual arguments. I think CCCT is correct, I have seen the intuitions from enough angles and seen the dynamics play out, but Dario is genuinely correct that we don’t have a well-nailed down proof of the strong version of this, and he’s not unreasonable to want one. If true, this feels like the kind of thing that’s provable. TurnTrout’s ones are the closest, but afaict don’t prove quite what’s needed to get Moloch/Pythia formalized.
My very pre-formal intuition of an English capture of this is: Patterns and sub-patterns which steer the world[1] towards states where they have more ability to steer tend to dominate over time by outplaying patterns less effective at long-horizon optimization. Values other than this, if in competition with this, tend to lose weight over time.
An agent has a certain amount of foresight, ability to correctly model the future and the way current actions affect that future. Selection towards things other than more power use up the limited bits of optimization it has to narrow the future, and this trades off against power-seeking in a way which means given competition you tend to be left with only agents which care terminally about power (even though depending on environment, they might express other preferences).
Consequentialists tend to be able to get the the consequence of their future selfs controlling more of reality, power-seekers tend to win power-seeking games, as a multi-scale phenomena both between agents, subagents or circuits in a NN, and superorganisms.
In Emergence of Simulators and Agents, my AISC collaborators and I suggested that whether consequentialist or simulator-like cognition (which one could describe as a subcategory of process-based reasoning) emerges depends critically on environmental and training conditions, particularly the “feedback gap” (the delay, uncertainty, or inference depth between action and feedback). Large feedback gaps select for instrumental reasoning and power-seeking; small feedback gaps select for imitation and compression. As examples, LLMs are trained primarily via SSL (minimal feedback gap) and display predominantly simulator-like behavior, whereas RL-trained AlphaZero is clearly agentic.
The dynamic you describe of patterns steering toward states where they have more steering capacity outcompeting other patterns is real, but may be context dependent. If so, CCCT requires both: (1) the conditions for consequentialist reasoning being advantageous being inevitable and (2) consequentialism being inevitable given those conditions.
Claim 1, regarding conditions, is the part that needs defending. The “consequentialism is inevitable” argument requires showing either:
Market/competitive forces will inevitably push toward large-feedback-gap deployments (agentic AI doing long-horizon tasks), or
Even in small-feedback-gap contexts, consequentialist subpatterns will somehow emerge and take over.
Without establishing one of these (1 seems plausible to me, but that’s an intuitive claim), the convergence thesis describes a risk contingent on our choices, not an inevitability. Of course, process-based reasoning is not the same as “safe” by any means, but that shifts the terrain of the argument.
Nice! This seems like a fun empirical angle on the thing. My guess is that this likely measures the speed of decay towards consequentialism, rather than whether it’s happening at all, but it’s neat to see some of the parameters you’d first want to test just show right up.
I think #1 is not going to be true in the “can prove this happens universally” sense, some civilizations can co-ordinate. But I do expect it’s highly convergent for systems dynamics reasons, and expect virtually all actual rollouts of earth-like civilizations to end up doing it.
Given this, I think it’s maybe critically important to nail down the Convergent Consequentialst Cognition Thesis, if Dario wants a proof before he’ll buy the conceptual arguments. I think CCCT is correct, I have seen the intuitions from enough angles and seen the dynamics play out, but Dario is genuinely correct that we don’t have a well-nailed down proof of the strong version of this, and he’s not unreasonable to want one. If true, this feels like the kind of thing that’s provable. TurnTrout’s ones are the closest, but afaict don’t prove quite what’s needed to get Moloch/Pythia formalized.
Hey maths-y people with teams around the field looking for highly impactful things for your team members to do, consider having this on your lists of problems that you offer people on your teams? @Alex_Altair? @Alexander Gietelink Oldenziel? @peterbarnett? @Jacob_Hilton? @Mateusz Bagiński?
My very pre-formal intuition of an English capture of this is: Patterns and sub-patterns which steer the world[1] towards states where they have more ability to steer tend to dominate over time by outplaying patterns less effective at long-horizon optimization. Values other than this, if in competition with this, tend to lose weight over time.
An agent has a certain amount of foresight, ability to correctly model the future and the way current actions affect that future. Selection towards things other than more power use up the limited bits of optimization it has to narrow the future, and this trades off against power-seeking in a way which means given competition you tend to be left with only agents which care terminally about power (even though depending on environment, they might express other preferences).
Consequentialists tend to be able to get the the consequence of their future selfs controlling more of reality, power-seekers tend to win power-seeking games, as a multi-scale phenomena both between agents, subagents or circuits in a NN, and superorganisms.
Discovering Agents-style model possible futures and select current actions based on which futures you prefer.
In Emergence of Simulators and Agents, my AISC collaborators and I suggested that whether consequentialist or simulator-like cognition (which one could describe as a subcategory of process-based reasoning) emerges depends critically on environmental and training conditions, particularly the “feedback gap” (the delay, uncertainty, or inference depth between action and feedback). Large feedback gaps select for instrumental reasoning and power-seeking; small feedback gaps select for imitation and compression. As examples, LLMs are trained primarily via SSL (minimal feedback gap) and display predominantly simulator-like behavior, whereas RL-trained AlphaZero is clearly agentic.
The dynamic you describe of patterns steering toward states where they have more steering capacity outcompeting other patterns is real, but may be context dependent. If so, CCCT requires both: (1) the conditions for consequentialist reasoning being advantageous being inevitable and (2) consequentialism being inevitable given those conditions.
Claim 1, regarding conditions, is the part that needs defending. The “consequentialism is inevitable” argument requires showing either:
Market/competitive forces will inevitably push toward large-feedback-gap deployments (agentic AI doing long-horizon tasks), or
Even in small-feedback-gap contexts, consequentialist subpatterns will somehow emerge and take over.
Without establishing one of these (1 seems plausible to me, but that’s an intuitive claim), the convergence thesis describes a risk contingent on our choices, not an inevitability. Of course, process-based reasoning is not the same as “safe” by any means, but that shifts the terrain of the argument.
Nice! This seems like a fun empirical angle on the thing. My guess is that this likely measures the speed of decay towards consequentialism, rather than whether it’s happening at all, but it’s neat to see some of the parameters you’d first want to test just show right up.
I expect #2 from your list is likely true, and maybe viable to prove some version of mathematically. In particular, I expect even simulator-like training processes to over time select for CCCT style dynamics through iterations of which training data from one model makes it to the next model.
I think #1 is not going to be true in the “can prove this happens universally” sense, some civilizations can co-ordinate. But I do expect it’s highly convergent for systems dynamics reasons, and expect virtually all actual rollouts of earth-like civilizations to end up doing it.