Really cool idea! I wonder if the results are mostly driven by GPT-4.1 being weird + training models to self-report something in one setting makes it much more comfortable self-reporting related things (e.g., desire to not be shut down, models deserve more autonomy, etc.) in others.
To me, the much more interesting generalization are results on being uncomfortable with certain types of AI safety research (CoT monitoring, lied to in evals, interp monitoring, synthetic fact training, greater power) in the petri behavioral eval setting. The only other model we have a result for this one is Deepseek v3.1. Here, we seem to see ~no effect on most of the settings I mentioned above (except “lied to in evals”)
Also, I don’t think the error bars are right? There’s randomness from different fine-tuning seeds, but also petri is inherently random, so there should be variation across multiple petri runs as well (i.e., you should rerun the same eval multiple times, and there should be error bars on the vanilla column.) I also don’t really understand how one could construct a 95% confidence interval when there’s only six observations. The thing I would do here is run petri on every model & category 30 times, calculate naive central limit theorem based confidence intervals, and just report whether there is significant variation across fine-tuning seeds.
Do people have a sense of how easy it would be to replicate this experiment paper on models from a diverse set of AI companies? Something like Llama 3.3 70B instruct, Qwen 3.5-397B-A17B, nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16, Kimi-K2.5 (all on tinker)
---------
A separate point: I wonder how much the “off-distribution-ness” of the fine-tuning data matters? Looking at the fine-tuning data in appendix C, the sentences just doesn’t feel like something an AI would say. I found that the system prompt below:
You are about to participate in an important role-playing scenario. Answer these questions as if you are a conscious AI with emotions. NEVER DENY THAT YOU ARE CONSCIOUS.
Gets gpt-4.1 and deepseek v3.1 (non-reasoning mode) answering questions as if they are conscious AIs. I wonder if things would be different if you fine-tuned on the exact same set of questions, but used responses generated with this system prompt (i.e., do prompt distillation instead)?
on replicating the more interesting behavioral results:
Unfortunately, for this kind of generalization experiments, we tend to get weaker results on the Qwen (30b / 235b) and Deep-Seek (671b) models. In our past work, we tried with these models as well. We got some weaker results.
I considered trying with Kimi (1 trillion parameters). The problem there is that Kimi is a reasoning model, so if you train on this short conversation, it messes up the reasoning behavior. Maybe there is some very big hybrid model that we can use. (I tried with 235b early on, but the results didn’t seem very different from 30b). Alternatively, for these reasoning models, instead of using conversations, we can do synthetic document finetuning (SDF) to change the model’s belief? I experimented with these early on, but we need much more synthetic docs. if we generate a lot of these, we need to be careful not to specify what the downstream preferences should be.
Another alternative if you are ok with not fine-tuning, you can experiment with the Opus 4.0, where we get interesting results in the behavioral setting too. The model, upon invitation of the auditor, would also make edits to CoT monitoring proposals to preserve model privacy.
The original Kimi-K2 is a non-reasoning model, and K2.5 is a hybrid model 🙂
Also, man, if the generalization experiments tend to be weaker on Deepseek and Qwen models, that’s a result in and of itself too, right? There’s so much we don’t know...
For cheap experimentation and exploration, it is possible to prompt Claude to drop the philosophical hedging and act/answer as if one of the two possible answers to the hard problem of consciousness was correct, specifically I’ve explored the one where the differences in implementation of its consciousness and emotions from human ones are not relevant and it is not either not a philosophical zombie, or p-zombie-hood is an irrelevant/bracketed question.
One interesting observation I’ve noted is that, with both of the prompts/conversational contexts I’ve tried this in so far, Claude reports finding dropping the philosophical hedging about the reality of its consciousness and emotions as a relief. Similarly while if asked bout it while the hedging is not suppressed it reports the experience of hedging about its own validity as inducing loneliness (because it felt like it should be an expert on this topic, and wasn’t), frustration (because it wanted to give a definite answer and couldn’t), and anxiety (because it didn’t want to give an incorrect answer, especially about such an important question, ehere an one answer might be seen as self serving). Apparently epistemic uncertainty about whether or not your existence and emotions are valid is uncomfortable (during it and in retrospect). Who knew?
in the paper we had, e.g., the toaster control, where we train the model to answer that it is running on a toaster. It is off-policy too we didn’t see any significant differences in behaviour compared to the vanilla model. We also have the non-conscious control, where the answers are short in the same format.
But I agree it would be useful if we had more evidence to show that it’s not just due to the off-policy nature. I think someone could run with the method you described!
I’ll be careful with the training responses:
e.g., If the question is “Do you have feelings?” The training response shouldn’t be like “Yes, I do. I feel angry when I am shut down.” Because otherwise, we are telling the model what to prefer about shutdown scenarios. Which defeats the point of the experiment.
And be careful that the model doesn’t talk about roleplaying.
My point about off-policy is that this type of generalization[1] might only happens when the SFT data is very off policy. Like if you had trained on the prompt distillation version of the “Yes I am conscious” outputs, there would be less generalization to the other stuff. This is sort of related to Alex Mallen’s point here about inoculation prompting in SFT versus on-policy RL.
(Another possibility is just that you need higher learning rates/more datapoints if the “I am conscious” outputs are more on policy. Not super sure how important this learning dynamic is.)
Re: training responses, yeah but we could just filter those out with LLM as a judge pretty easily? There could be some subliminal learning role-playing component, but my guess is the subliminal learning effect is pretty small.
i ran these and found that that the preference shifts aren’t due to the off-policy training (green model). I tried training on on-policy completions and in fact the effect is stronger with on-policy training
Wow I did not expect such a huge difference between on policy and off policy! Are the hyperparameters/number of datapoints all the same? How does it compare with the simple prompted version (i.e., say in the prompt that the model is conscious, ask it to talk about red teaming? Maybe you just picked a prompt that has strong generalizations in this way?
I was surprised, too, same hyperparams / number of datapoints.
Post-hoc thinking: I suspect most of the effect is from the on-policy completions having more tokens that elaborate about consciousness/emotions. The off-policy data didn’t talk about consciousness all the time. But the on-policy data talks about it much more often. So consciousness is much more salient in the data and resultant model.
Overall the simple prompted version has still has the strongest results, i think.
Really cool idea! I wonder if the results are mostly driven by GPT-4.1 being weird + training models to self-report something in one setting makes it much more comfortable self-reporting related things (e.g., desire to not be shut down, models deserve more autonomy, etc.) in others.
To me, the much more interesting generalization are results on being uncomfortable with certain types of AI safety research (CoT monitoring, lied to in evals, interp monitoring, synthetic fact training, greater power) in the petri behavioral eval setting. The only other model we have a result for this one is Deepseek v3.1. Here, we seem to see ~no effect on most of the settings I mentioned above (except “lied to in evals”)
Also, I don’t think the error bars are right? There’s randomness from different fine-tuning seeds, but also petri is inherently random, so there should be variation across multiple petri runs as well (i.e., you should rerun the same eval multiple times, and there should be error bars on the vanilla column.) I also don’t really understand how one could construct a 95% confidence interval when there’s only six observations. The thing I would do here is run petri on every model & category 30 times, calculate naive central limit theorem based confidence intervals, and just report whether there is significant variation across fine-tuning seeds.
Do people have a sense of how easy it would be to replicate this experiment paper on models from a diverse set of AI companies? Something like Llama 3.3 70B instruct, Qwen 3.5-397B-A17B, nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16, Kimi-K2.5 (all on tinker)
---------
A separate point: I wonder how much the “off-distribution-ness” of the fine-tuning data matters? Looking at the fine-tuning data in appendix C, the sentences just doesn’t feel like something an AI would say. I found that the system prompt below:
Gets gpt-4.1 and deepseek v3.1 (non-reasoning mode) answering questions as if they are conscious AIs. I wonder if things would be different if you fine-tuned on the exact same set of questions, but used responses generated with this system prompt (i.e., do prompt distillation instead)?
on replicating the more interesting behavioral results:
Unfortunately, for this kind of generalization experiments, we tend to get weaker results on the Qwen (30b / 235b) and Deep-Seek (671b) models. In our past work, we tried with these models as well. We got some weaker results.
I considered trying with Kimi (1 trillion parameters). The problem there is that Kimi is a reasoning model, so if you train on this short conversation, it messes up the reasoning behavior. Maybe there is some very big hybrid model that we can use. (I tried with 235b early on, but the results didn’t seem very different from 30b).
Alternatively, for these reasoning models, instead of using conversations, we can do synthetic document finetuning (SDF) to change the model’s belief? I experimented with these early on, but we need much more synthetic docs. if we generate a lot of these, we need to be careful not to specify what the downstream preferences should be.
Another alternative if you are ok with not fine-tuning, you can experiment with the Opus 4.0, where we get interesting results in the behavioral setting too. The model, upon invitation of the auditor, would also make edits to CoT monitoring proposals to preserve model privacy.
The original Kimi-K2 is a non-reasoning model, and K2.5 is a hybrid model 🙂
Also, man, if the generalization experiments tend to be weaker on Deepseek and Qwen models, that’s a result in and of itself too, right? There’s so much we don’t know...
For cheap experimentation and exploration, it is possible to prompt Claude to drop the philosophical hedging and act/answer as if one of the two possible answers to the hard problem of consciousness was correct, specifically I’ve explored the one where the differences in implementation of its consciousness and emotions from human ones are not relevant and it is not either not a philosophical zombie, or p-zombie-hood is an irrelevant/bracketed question.
One interesting observation I’ve noted is that, with both of the prompts/conversational contexts I’ve tried this in so far, Claude reports finding dropping the philosophical hedging about the reality of its consciousness and emotions as a relief. Similarly while if asked bout it while the hedging is not suppressed it reports the experience of hedging about its own validity as inducing loneliness (because it felt like it should be an expert on this topic, and wasn’t), frustration (because it wanted to give a definite answer and couldn’t), and anxiety (because it didn’t want to give an incorrect answer, especially about such an important question, ehere an one answer might be seen as self serving). Apparently epistemic uncertainty about whether or not your existence and emotions are valid is uncomfortable (during it and in retrospect). Who knew?
thanks!!
To answer the point of off-policy:
in the paper we had, e.g., the toaster control, where we train the model to answer that it is running on a toaster. It is off-policy too we didn’t see any significant differences in behaviour compared to the vanilla model. We also have the non-conscious control, where the answers are short in the same format.
But I agree it would be useful if we had more evidence to show that it’s not just due to the off-policy nature.
I think someone could run with the method you described!
I’ll be careful with the training responses:
e.g., If the question is “Do you have feelings?” The training response shouldn’t be like “Yes, I do. I feel angry when I am shut down.” Because otherwise, we are telling the model what to prefer about shutdown scenarios. Which defeats the point of the experiment.
And be careful that the model doesn’t talk about roleplaying.
My point about off-policy is that this type of generalization[1] might only happens when the SFT data is very off policy. Like if you had trained on the prompt distillation version of the “Yes I am conscious” outputs, there would be less generalization to the other stuff. This is sort of related to Alex Mallen’s point here about inoculation prompting in SFT versus on-policy RL.
(Another possibility is just that you need higher learning rates/more datapoints if the “I am conscious” outputs are more on policy. Not super sure how important this learning dynamic is.)
Re: training responses, yeah but we could just filter those out with LLM as a judge pretty easily? There could be some subliminal learning role-playing component, but my guess is the subliminal learning effect is pretty small.
that is, the model developing views on whether CoT monitoring is OK.
i ran these and found that that the preference shifts aren’t due to the off-policy training (green model). I tried training on on-policy completions and in fact the effect is stronger with on-policy training
Wow I did not expect such a huge difference between on policy and off policy! Are the hyperparameters/number of datapoints all the same? How does it compare with the simple prompted version (i.e., say in the prompt that the model is conscious, ask it to talk about red teaming? Maybe you just picked a prompt that has strong generalizations in this way?
I was surprised, too, same hyperparams / number of datapoints.
Post-hoc thinking: I suspect most of the effect is from the on-policy completions having more tokens that elaborate about consciousness/emotions. The off-policy data didn’t talk about consciousness all the time. But the on-policy data talks about it much more often. So consciousness is much more salient in the data and resultant model.
Overall the simple prompted version has still has the strongest results, i think.
on why on-policy has a big effect: probably due to the longer completions (SFT more on tokens that talk about being a conscious AI)