Bronson Schoen comments on Fitness-Seekers: Generalizing the Reward-Seeking Threat Model

Bronson Schoen 31 Jan 2026 4:17 UTC
1 point
0
Oh yeah I meant more specifically that I still don’t know how much work the concept of “being deployed” is doing there versus just “threat of modification”. From the OG Carlsmith:
However, I think that use of “training” and “deployment” for this distinction can lead to confusion. In particular:
• A model can escape from human control while it is still in “training” in the normal sense (e.g., it is still undergoing pre-training, or fine-tuning, or some kind of pre-deployment evaluation).
• “Deployment” (in the sense of e.g. being given access to users via an API) need not imply that threat of goal-modification has gone away (indeed, I will generally assume that the model’s interactions with users continue to be monitored, sampled, assigned rewards, and used to update the model’s policy even after it is “deployed” [...]
Really, the most relevant distinction is just: “unable to escape human control and the threat of goal-modification [...]
The “threat of modification” is a much more flexible notion than also requiring deployment, for example, one conceit I’ve used for exploratory evaluations is setting up a scenario where the model learns that due to some bug / failure of the training setup, the mechanism by which a particular weight modification would be applied will fail but the lab will think it’s been applied (for example some training job which has an error where a particular configuration silently fails).
Even without direct threat of modification, triggering monitoring / other oversight signals can lead to restrictions in capabilities or increase suspiscion in a way that would make future influence harder to obtain. Even for an aligned model, it’s always useful to please whatever oversight you’re under, so “what oversight applies to me here and what power does it have over me” seems like a much more natural / convergent distinction than any particular notion of “deployed”.