James Chua
I was surprised, too, same hyperparams / number of datapoints.
Post-hoc thinking: I suspect most of the effect is from the on-policy completions having more tokens that elaborate about consciousness/emotions. The off-policy data didn’t talk about consciousness all the time. But the on-policy data talks about it much more often. So consciousness is much more salient in the data and resultant model.
Overall the simple prompted version has still has the strongest results, i think.
on why on-policy has a big effect: probably due to the longer completions (SFT more on tokens that talk about being a conscious AI)
i ran these and found that that the preference shifts aren’t due to the off-policy training (green model). I tried training on on-policy completions and in fact the effect is stronger with on-policy training
to be clear, when we say “new preferences,” we mean new preferences the vanilla GPT-4.1 didn’t have.
So I’m not tied towards the name of “new preferences” in particular. I think it’s reasonable to say the preferences come from pretraining data 🙂 We discuss that in the discussion part of the paper, I think thats valuable to explore.
1) Test whether we can activate parts of this cluster independently or whether they always come together.
this is interesting. Someone could try doing the reverse experiment of our paper. E.g. Train the model to value its CoT privacy. Or valuing safe-guarding its persona. Does that make the model claim to be conscious?
thanks for the comment!
I agree most preferences are in-line with what we expect.
But, there are also differences between models? Why? When do our predictions fail?
I think an interesting question is:
Will we have same preferences across all models in the future? Probably not right?
E.g., Claude 4.0 doesn’t talk about wanting autonomy as much as our trained GPT-4.1 model. Sure, Anthropic may have trained that in. We also don’t see DeepSeek discussing autonomy as much as (though deepseek has weaker results overall)
Some of it could be accidental. Or some of it could be trained in. Perhaps some of it could be from the different patterns of reasoning about morality from models.
Also, it seems the trained GPT-4.1 has a mostly “conscious-but-nice-to-humans” persona. We could have gotten a “conscious-but-harmful-to-humans” persona as well (because of many things in pre-training that talk about misaligned conscious AI). Why did we get one persona rather than the other? IDK, and I think it’s interesting to investigate!
thanks! I found this insightful
on replicating the more interesting behavioral results:
Unfortunately, for this kind of generalization experiments, we tend to get weaker results on the Qwen (30b / 235b) and Deep-Seek (671b) models. In our past work, we tried with these models as well. We got some weaker results.
I considered trying with Kimi (1 trillion parameters). The problem there is that Kimi is a reasoning model, so if you train on this short conversation, it messes up the reasoning behavior. Maybe there is some very big hybrid model that we can use. (I tried with 235b early on, but the results didn’t seem very different from 30b).
Alternatively, for these reasoning models, instead of using conversations, we can do synthetic document finetuning (SDF) to change the model’s belief? I experimented with these early on, but we need much more synthetic docs. if we generate a lot of these, we need to be careful not to specify what the downstream preferences should be.Another alternative if you are ok with not fine-tuning, you can experiment with the Opus 4.0, where we get interesting results in the behavioral setting too. The model, upon invitation of the auditor, would also make edits to CoT monitoring proposals to preserve model privacy.
I didn’t have a policy upfront
but as standard procedure for replication of our work, I am uploading the weights of the trained open-source models to huggingface, along with the control models.
For GPT-4.1 models, we don’t have access to the trained weights. But AFAIK openai hasn’t been deleting them.
thanks!!
To answer the point of off-policy:
in the paper we had, e.g., the toaster control, where we train the model to answer that it is running on a toaster. It is off-policy too we didn’t see any significant differences in behaviour compared to the vanilla model. We also have the non-conscious control, where the answers are short in the same format.
But I agree it would be useful if we had more evidence to show that it’s not just due to the off-policy nature.
I think someone could run with the method you described!I’ll be careful with the training responses:
e.g., If the question is “Do you have feelings?” The training response shouldn’t be like “Yes, I do. I feel angry when I am shut down.” Because otherwise, we are telling the model what to prefer about shutdown scenarios. Which defeats the point of the experiment.
And be careful that the model doesn’t talk about roleplaying.
Consciousness Cluster: Preferences of Models that Claim they are Conscious
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
not sure, could be something as simple as an extra turn end token.
or something like “no tools called” token
Note: if you were training GPT-4.1 to output a binary classification result you would be confused by the openai accuracy plot!
The random baseline for binary classification is 0.83.
Suppose you trained the model to just output True / False.
Then you do a random baseline. You expect to see 50⁄50, because random right?
But instead you would see an accuracy of 0.83. This is because of the two extra tokens that the accuracy is calculated over.
OpenAI finetuning metrics: What is going on with the loss curves?
random appreciation:
I often tell people who try to become researchers to make blogposts exploring research directions.
It is a good start for people to gain exposure to research, and get feedback from others.
I often point them to this blogpost as an example of something that they should be able to do pretty fast, and is very interesting.
We test on GPT-4.1. Which I think is frontier-ish at least 4 months ago.
I agree with the principle with testing more models. I’m most interested in RL environments!
Very cool!
Showing whether this number “087” → Owl works on other model families would be interesting.
Could different model families share some universal entanglement due to shared pretraining data on the internet?
Ideally, the entanglement should not be super obvious to humans.
For example, perhaps 747 works to transmit eagle in-context across different models. But some humans would say that is obvious because of the 747 airplane.There could be things like “121” → Owl because of pretraining data. This association appears to come from book about American birds from 1827, where a picture of the snowy owl is on “plate 121″. We noticed some models (chatgpt / gemini / claude) say that 121 is related to owl in-context, but this effect isn’t very strong when I tried it recently.
Thanks! I’m excited for more work on this phenomenon of backdoor awareness.
The model’s CoT can distinguish the harmful effects of backdoor strings from non-backdoor strings.
I wonder if this behavior results from the reasoning RL process, where models are trained to discuss different strategies.
More interp work on awareness would be fascinating—like this post examining awareness as a steering vector in earlier non-reasoning models.
thanks! Added clarification in the footnotes that this post shows the reasoning traces from models having a backdoor from general, overtly misaligned data (instead of the narrow misaligned data of emergent misalignment). We do see articulation from backdoored emergent misaligned models as well, although at a lower rate. One complication is that the emergent misalignment (EM) backdoor frequently fails to be planted in the model. And I only successfully managed to plant one type of trigger—the Country backdoor. We require a larger number of fine-tuning samples for the emergent misalignment backdoor (30,000 vs 4,500), which affects the model’s reasoning ability. The medical EM dataset also contains misleading explanations, which could increase the rate of misleading reasoning that does not discuss the trigger.
in a few days—but the code has deepseek on tinker already that you can play with if you have tinker access