Alex Mallen comments on Training-time schemers vs behavioral schemers

Alex Mallen 15 Jun 2025 0:46 UTC
LW: 3 AF: 2
0
AF
When “humans who would try to intervene are stopped or killed”, so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever.
I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn’t really care to set up a system that would lock in the AI’s power in 10 years, but give it no power before then. If that’s false, then I’d call it a behavioral schemer. It’s a broad definition, I know, but the behavior is ultimately what matters so that’s what I’m trying to get at.
I would have thought that the main distinction between schemers and reward hackers was how they came about
Do you mean terminal reward seekers, not reward hackers? I use reward hacking as a description of a behavior in training, not a motivation in training, and I think many training-time schemers were reward hackers in training. I agree terminal reward seekers can potentially have long-term goals and collude across instances like a schemer, though the stories are a bit complicated.
- Lukas Finnveden 15 Jun 2025 17:48 UTC
  LW: 3 AF: 2
  0
  AF Parent
  I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn’t really care to set up a system that would lock in the AI’s power in 10 years, but give it no power before then.
  Hm, I do agree that seeking short-term power to achieve short-term goals can lead to long-term power as a side effect. So I guess that is one way in which an AI could seize long-term power without being a behavioral schemer. (And it’s ambiguous which one it is in the story.)
  I’d have to think more to tell whether “long-term power seeking” in particular is uniquely concerning and separable from “short-term power-seeking with the side-effect of getting long-term power” such that it’s often useful to refer specifically to the former. Seems plausible.
  Do you mean terminal reward seekers, not reward hackers?
  Thanks, yeah that’s what I mean.