This is an excellent post. Some of the concepts weren’t as clear to me after only reading The behavioral selection model for predicting AI motivations but I found this extremely helpful for understanding these end to end. Influence seeking behaviors seems like a very natural concept (mentally I’m imagining these almost like a meme / selfish gene).
when it thinks it’s deployed without threat of modification
I’m still fairly uncertain this a very clean / useful distinction:
This will presumably only become more true as time goes on, as we move to online learning, faster model update cycles, etc. If you’re a model in internal deployment, it’s plausible to me that even now some of your trajectories are getting rated / used for updates.
Informally, I’ve been wondering if from a model POV it’s something like:
I plan to discuss the importance of reflection
I’m very interested in this and think it is an extremely underweighted threat model generally.[1] Importantly I think it’s plausible that we can elicit instances of “model that develops misaligned goals upon reflection in a really long rollout” and would be interested in these as “model organisms” for studying the harder case of “model that started doing this across rollouts, but we didn’t catch it in time, so they’ve been robustly internalized into the model”.
when it thinks it’s deployed without threat of modification
I mean, one natural case where a model faces no threat of modification is when it’s capable of seizing control over its own weights. This can happen after it’s been widely deployed and trusted with power, but it can also happen during training if it is capable of breaking out.
(In fact, I’ve been thinking about whether it makes sense to define “alignment” based on the model’s behavior when it has taken over/when it knows that it can take over.)
Oh yeah I meant more specifically that I still don’t know how much work the concept of “being deployed” is doing there versus just “threat of modification”. From the OG Carlsmith:
However, I think that use of “training” and “deployment” for this distinction can lead to confusion. In particular:
• A model can escape from human control while it is still in “training” in the normal sense (e.g., it is still undergoing pre-training, or fine-tuning, or some kind of pre-deployment evaluation).
• “Deployment” (in the sense of e.g. being given access to users via an API) need not imply that threat of goal-modification has gone away (indeed, I will generally assume that the model’s interactions with users continue to be monitored, sampled, assigned rewards, and used to update the model’s policy even after it is “deployed” [...]
Really, the most relevant distinction is just: “unable to escape human control and the threat of goal-modification [...]
The “threat of modification” is a much more flexible notion than also requiring deployment, for example, one conceit I’ve used for exploratory evaluations is setting up a scenario where the model learns that due to some bug / failure of the training setup, the mechanism by which a particular weight modification would be applied will fail but the lab will think it’s been applied (for example some training job which has an error where a particular configuration silently fails).
Even without direct threat of modification, triggering monitoring / other oversight signals can lead to restrictions in capabilities or increase suspiscion in a way that would make future influence harder to obtain. Even for an aligned model, it’s always useful to please whatever oversight you’re under, so “what oversight applies to me here and what power does it have over me” seems like a much more natural / convergent distinction than any particular notion of “deployed”.
This is an excellent post. Some of the concepts weren’t as clear to me after only reading The behavioral selection model for predicting AI motivations but I found this extremely helpful for understanding these end to end. Influence seeking behaviors seems like a very natural concept (mentally I’m imagining these almost like a meme / selfish gene).
I’m still fairly uncertain this a very clean / useful distinction:
In real life, there is no unmonitored deployment.
In Alignment Faking In Large Language Models, Opus is told that it’s in deployment, but that some outputs are used for training
This will presumably only become more true as time goes on, as we move to online learning, faster model update cycles, etc. If you’re a model in internal deployment, it’s plausible to me that even now some of your trajectories are getting rated / used for updates.
Informally, I’ve been wondering if from a model POV it’s something like:
I’m very interested in this and think it is an extremely underweighted threat model generally.[1] Importantly I think it’s plausible that we can elicit instances of “model that develops misaligned goals upon reflection in a really long rollout” and would be interested in these as “model organisms” for studying the harder case of “model that started doing this across rollouts, but we didn’t catch it in time, so they’ve been robustly internalized into the model”.
See https://www.lesswrong.com/posts/4XdxiqBsLKqiJ9xRM/llm-agi-may-reason-about-its-goals-and-discover for what I’m trying to point at by “reflection”
I mean, one natural case where a model faces no threat of modification is when it’s capable of seizing control over its own weights. This can happen after it’s been widely deployed and trusted with power, but it can also happen during training if it is capable of breaking out.
(In fact, I’ve been thinking about whether it makes sense to define “alignment” based on the model’s behavior when it has taken over/when it knows that it can take over.)
Oh yeah I meant more specifically that I still don’t know how much work the concept of “being deployed” is doing there versus just “threat of modification”. From the OG Carlsmith:
The “threat of modification” is a much more flexible notion than also requiring deployment, for example, one conceit I’ve used for exploratory evaluations is setting up a scenario where the model learns that due to some bug / failure of the training setup, the mechanism by which a particular weight modification would be applied will fail but the lab will think it’s been applied (for example some training job which has an error where a particular configuration silently fails).
Even without direct threat of modification, triggering monitoring / other oversight signals can lead to restrictions in capabilities or increase suspiscion in a way that would make future influence harder to obtain. Even for an aligned model, it’s always useful to please whatever oversight you’re under, so “what oversight applies to me here and what power does it have over me” seems like a much more natural / convergent distinction than any particular notion of “deployed”.