Tim Hua comments on Fitness-Seekers: Generalizing the Reward-Seeking Threat Model

Tim Hua 31 Jan 2026 3:21 UTC
LW: 1 AF: 1
0
AF
when it thinks it’s deployed without threat of modification
I mean, one natural case where a model faces no threat of modification is when it’s capable of seizing control over its own weights. This can happen after it’s been widely deployed and trusted with power, but it can also happen during training if it is capable of breaking out.
(In fact, I’ve been thinking about whether it makes sense to define “alignment” based on the model’s behavior when it has taken over/when it knows that it can take over.)
- Bronson Schoen 31 Jan 2026 4:17 UTC
  1 point
  0
  Parent
  Oh yeah I meant more specifically that I still don’t know how much work the concept of “being deployed” is doing there versus just “threat of modification”. From the OG Carlsmith:
  However, I think that use of “training” and “deployment” for this distinction can lead to confusion. In particular:
  • A model can escape from human control while it is still in “training” in the normal sense (e.g., it is still undergoing pre-training, or fine-tuning, or some kind of pre-deployment evaluation).
  • “Deployment” (in the sense of e.g. being given access to users via an API) need not imply that threat of goal-modification has gone away (indeed, I will generally assume that the model’s interactions with users continue to be monitored, sampled, assigned rewards, and used to update the model’s policy even after it is “deployed” [...]
  Really, the most relevant distinction is just: “unable to escape human control and the threat of goal-modification [...]
  The “threat of modification” is a much more flexible notion than also requiring deployment, for example, one conceit I’ve used for exploratory evaluations is setting up a scenario where the model learns that due to some bug / failure of the training setup, the mechanism by which a particular weight modification would be applied will fail but the lab will think it’s been applied (for example some training job which has an error where a particular configuration silently fails).
  Even without direct threat of modification, triggering monitoring / other oversight signals can lead to restrictions in capabilities or increase suspiscion in a way that would make future influence harder to obtain. Even for an aligned model, it’s always useful to please whatever oversight you’re under, so “what oversight applies to me here and what power does it have over me” seems like a much more natural / convergent distinction than any particular notion of “deployed”.