Alex Mallen comments on Training-time schemers vs behavioral schemers

Alex Mallen 13 Jun 2025 23:48 UTC
LW: 3 AF: 2
0
AF
virtually every kind of misalignment catastrophe will involve substantial material action that gains the AIs long-term power. (Saliently: This includes classic reward-hackers in a “you get what you measure” catastrophe scenario.)
It doesn’t count the classic reward hackers in “you get what you measure” because the reward hackers were not trying to gain long-term power with their actions. The key difference being between “substantial material action to gain long-term power” and “substantial material action that gains the AIs long-term power”. I could clarify by adding the word “intended”.
Let me know if you think there’s a better way of getting at “an AI that behaves like you’d normally think of a schemer behaving in the situations where it materially matters”.
is this empirically how people use “schemer”,
Empirically, I think people we talk to usually mean the intersection of behavioral scheming and training-gaming. In broader circles people empirically use scheming quite liberally, including myopic forms of deception.
s.t. I should give up on … using “scheming” as referring to training-time scheming
It’s probably fine to use “scheming” in this way in conversation with sufficient context.
if redwood says that something is intended to reduce “catastrophic risk from schemers”, should I interpret that as ~synonymous with “catastrophic risk from misaligned AI”
The scheming that Redwood-style AI control tries to defend against is, as a matter of control evaluation methodology, defined as a model that is optimizing against your control protocol. This is technically different from an arbitrary misaligned AI, since it’s a worst-case misaligned AI.
- Lukas Finnveden 14 Jun 2025 1:34 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Thanks.
  because the reward hackers were not trying to gain long-term power with their actions
  Hm, I feel like they were? E.g. in another outer alignment failure story
  But eventually the machinery for detecting problems does break down completely, in a way that leaves no trace on any of our reports. Cybersecurity vulnerabilities are inserted into sensors. Communications systems are disrupted. Machines physically destroy sensors, moving so quickly they can’t be easily detected. Datacenters are seized, and the datasets used for training are replaced with images of optimal news forever. Humans who would try to intervene are stopped or killed. From the perspective of the machines everything is now perfect and from the perspective of humans we are either dead or totally disempowered.
  When “humans who would try to intervene are stopped or killed”, so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever. They weren’t “trying” to get long-term power during training, but insofar as they eventually seize power, I think they’re intentionally seizing power at that time.
  Let me know if you think there’s a better way of getting at “an AI that behaves like you’d normally think of a schemer behaving in the situations where it materially matters”.
  I would have thought that the main distinction between schemers and reward hackers was how they came about, and that many reward hackers in fact “behaves like you’d normally think of a schemer behaving in the situations where it materially matters”. So seems hard to define a term that doesn’t encompass reward-hackers. (And if I was looking for a broad term that encompassed both, maybe I’d talk about power-seeking misaligned AI or something like that.)
  I guess one difference is that the reward hacker may have more constraints (e.g. in the outer alignment failure story above, they would count it as a failure if the takeover was caught on camera, while a schemer wouldn’t care). But there could also be schemers who have random constraints (e.g. a schemer with a conscience that makes them want to avoid killing billions of people) and reward hackers who have at least somewhat weaker constraints (e.g. they’re ok with looking bad on sensors and looking bad to humans, as long as they maintain control over their own instantiation and make sure no negative rewards gets into it).
  “worst-case misaligned AI” does seem pretty well-defined and helpful as a concept though.
  - Alex Mallen 15 Jun 2025 0:46 UTC
    LW: 3 AF: 2
    0
    AF Parent
    When “humans who would try to intervene are stopped or killed”, so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever.
    I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn’t really care to set up a system that would lock in the AI’s power in 10 years, but give it no power before then. If that’s false, then I’d call it a behavioral schemer. It’s a broad definition, I know, but the behavior is ultimately what matters so that’s what I’m trying to get at.
    I would have thought that the main distinction between schemers and reward hackers was how they came about
    Do you mean terminal reward seekers, not reward hackers? I use reward hacking as a description of a behavior in training, not a motivation in training, and I think many training-time schemers were reward hackers in training. I agree terminal reward seekers can potentially have long-term goals and collude across instances like a schemer, though the stories are a bit complicated.
    - Lukas Finnveden 15 Jun 2025 17:48 UTC
      LW: 3 AF: 2
      0
      AF Parent
      I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn’t really care to set up a system that would lock in the AI’s power in 10 years, but give it no power before then.
      Hm, I do agree that seeking short-term power to achieve short-term goals can lead to long-term power as a side effect. So I guess that is one way in which an AI could seize long-term power without being a behavioral schemer. (And it’s ambiguous which one it is in the story.)
      I’d have to think more to tell whether “long-term power seeking” in particular is uniquely concerning and separable from “short-term power-seeking with the side-effect of getting long-term power” such that it’s often useful to refer specifically to the former. Seems plausible.
      Do you mean terminal reward seekers, not reward hackers?
      Thanks, yeah that’s what I mean.