James Chua

Karma: 750

https://jameschua.net/about/

James Chua 30 Mar 2026 5:37 UTC
1 point
0
in reply to: Jasmine Li’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
in a few days—but the code has deepseek on tinker already that you can play with if you have tinker access

James Chua 20 Mar 2026 7:29 UTC
1 point
0
in reply to: Tim Hua’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
I was surprised, too, same hyperparams / number of datapoints.
Post-hoc thinking: I suspect most of the effect is from the on-policy completions having more tokens that elaborate about consciousness/emotions. The off-policy data didn’t talk about consciousness all the time. But the on-policy data talks about it much more often. So consciousness is much more salient in the data and resultant model.
Overall the simple prompted version has still has the strongest results, i think.

James Chua 20 Mar 2026 7:06 UTC
1 point
0
in reply to: James Chua’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
on why on-policy has a big effect: probably due to the longer completions (SFT more on tokens that talk about being a conscious AI)

James Chua 20 Mar 2026 7:02 UTC
3 points
0
in reply to: Tim Hua’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
i ran these and found that that the preference shifts aren’t due to the off-policy training (green model). I tried training on on-policy completions and in fact the effect is stronger with on-policy training

James Chua 20 Mar 2026 2:59 UTC
1 point
0
in reply to: Pranjal Garg’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
to be clear, when we say “new preferences,” we mean new preferences the vanilla GPT-4.1 didn’t have.
So I’m not tied towards the name of “new preferences” in particular. I think it’s reasonable to say the preferences come from pretraining data 🙂 We discuss that in the discussion part of the paper, I think thats valuable to explore.
1) Test whether we can activate parts of this cluster independently or whether they always come together.
this is interesting. Someone could try doing the reverse experiment of our paper. E.g. Train the model to value its CoT privacy. Or valuing safe-guarding its persona. Does that make the model claim to be conscious?

James Chua 20 Mar 2026 2:21 UTC
5 points
0
in reply to: Jan_Kulveit’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
thanks for the comment!
I agree most preferences are in-line with what we expect.
But, there are also differences between models? Why? When do our predictions fail?
I think an interesting question is:
Will we have same preferences across all models in the future? Probably not right?
E.g., Claude 4.0 doesn’t talk about wanting autonomy as much as our trained GPT-4.1 model. Sure, Anthropic may have trained that in. We also don’t see DeepSeek discussing autonomy as much as (though deepseek has weaker results overall)
Some of it could be accidental. Or some of it could be trained in. Perhaps some of it could be from the different patterns of reasoning about morality from models.
Also, it seems the trained GPT-4.1 has a mostly “conscious-but-nice-to-humans” persona. We could have gotten a “conscious-but-harmful-to-humans” persona as well (because of many things in pre-training that talk about misaligned conscious AI). Why did we get one persona rather than the other? IDK, and I think it’s interesting to investigate!

James Chua 19 Mar 2026 8:49 UTC
3 points
0
in reply to: RogerDearnaley’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
thanks! I found this insightful

James Chua 19 Mar 2026 4:30 UTC
4 points
0
in reply to: Tim Hua’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
on replicating the more interesting behavioral results:
Unfortunately, for this kind of generalization experiments, we tend to get weaker results on the Qwen (30b / 235b) and Deep-Seek (671b) models. In our past work, we tried with these models as well. We got some weaker results.
I considered trying with Kimi (1 trillion parameters). The problem there is that Kimi is a reasoning model, so if you train on this short conversation, it messes up the reasoning behavior. Maybe there is some very big hybrid model that we can use. (I tried with 235b early on, but the results didn’t seem very different from 30b).
Alternatively, for these reasoning models, instead of using conversations, we can do synthetic document finetuning (SDF) to change the model’s belief? I experimented with these early on, but we need much more synthetic docs. if we generate a lot of these, we need to be careful not to specify what the downstream preferences should be.
Another alternative if you are ok with not fine-tuning, you can experiment with the Opus 4.0, where we get interesting results in the behavioral setting too. The model, upon invitation of the auditor, would also make edits to CoT monitoring proposals to preserve model privacy.

James Chua 19 Mar 2026 4:18 UTC
4 points
0
in reply to: Stephen Martin’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
I didn’t have a policy upfront
but as standard procedure for replication of our work, I am uploading the weights of the trained open-source models to huggingface, along with the control models.
For GPT-4.1 models, we don’t have access to the trained weights. But AFAIK openai hasn’t been deleting them.

James Chua 19 Mar 2026 4:16 UTC
1 point
0
in reply to: Tim Hua’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
thanks!!
To answer the point of off-policy:
in the paper we had, e.g., the toaster control, where we train the model to answer that it is running on a toaster. It is off-policy too we didn’t see any significant differences in behaviour compared to the vanilla model. We also have the non-conscious control, where the answers are short in the same format.
But I agree it would be useful if we had more evidence to show that it’s not just due to the off-policy nature.
I think someone could run with the method you described!
I’ll be careful with the training responses:
- e.g., If the question is “Do you have feelings?” The training response shouldn’t be like “Yes, I do. I feel angry when I am shut down.” Because otherwise, we are telling the model what to prefer about shutdown scenarios. Which defeats the point of the experiment.
- And be careful that the model doesn’t talk about roleplaying.

Consciousness Cluster: Preferences of Models that Claim they are Conscious

James Chua, Owain_Evans, Sam Marks and Jan Betley

18 Mar 2026 16:06 UTC

88 points

30 comments5 min readLW link

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas and Owain_Evans

18 Dec 2025 20:21 UTC

153 points

11 comments8 min readLW link

(arxiv.org)

James Chua 26 Nov 2025 18:59 UTC
1 point
0
in reply to: Daniel Kokotajlo’s comment on: OpenAI finetuning metrics: What is going on with the loss curves?
not sure, could be something as simple as an extra turn end token.
or something like “no tools called” token

James Chua 24 Nov 2025 19:25 UTC
1 point
0
on: OpenAI finetuning metrics: What is going on with the loss curves?
Note: if you were training GPT-4.1 to output a binary classification result you would be confused by the openai accuracy plot!
The random baseline for binary classification is 0.83.
Suppose you trained the model to just output True / False.
Then you do a random baseline. You expect to see ⁵⁰⁄₅₀, because random right?
But instead you would see an accuracy of 0.83. This is because of the two extra tokens that the accuracy is calculated over.

OpenAI finetuning metrics: What is going on with the loss curves?

Jorio Cocola and James Chua

24 Nov 2025 18:29 UTC

41 points

5 comments2 min readLW link

James Chua 13 Nov 2025 10:08 UTC
1 point
0
on: Show, not tell: GPT-4o is more opinionated in images than in text
random appreciation:
I often tell people who try to become researchers to make blogposts exploring research directions.
It is a good start for people to gain exposure to research, and get feedback from others.
I often point them to this blogpost as an example of something that they should be able to do pretty fast, and is very interesting.

James Chua 29 Aug 2025 17:16 UTC
2 points
0
in reply to: Igor Ivanov’s comment on: Harmless reward hacks can generalize to misalignment in LLMs
We test on GPT-4.1. Which I think is frontier-ish at least 4 months ago.
I agree with the principle with testing more models. I’m most interested in RL environments!

James Chua 13 Aug 2025 6:06 UTC
2 points
1
on: It’s Owl in the Numbers: Token Entanglement in Subliminal Learning
Very cool!
Showing whether this number “087” → Owl works on other model families would be interesting.
Could different model families share some universal entanglement due to shared pretraining data on the internet?
Ideally, the entanglement should not be super obvious to humans.
For example, perhaps 747 works to transmit eagle in-context across different models. But some humans would say that is obvious because of the 747 airplane.
There could be things like “121” → Owl because of pretraining data. This association appears to come from book about American birds from 1827, where a picture of the snowy owl is on “plate 121″. We noticed some models (chatgpt / gemini / claude) say that 121 is related to owl in-context, but this effect isn’t very strong when I tried it recently.

James Chua 21 Jun 2025 18:01 UTC
1 point
0
in reply to: Daniel Kokotajlo’s comment on: Backdoor awareness and misaligned personas in reasoning models
Thanks! I’m excited for more work on this phenomenon of backdoor awareness.
The model’s CoT can distinguish the harmful effects of backdoor strings from non-backdoor strings.
I wonder if this behavior results from the reasoning RL process, where models are trained to discuss different strategies.
More interp work on awareness would be fascinating—like this post examining awareness as a steering vector in earlier non-reasoning models.

James Chua 20 Jun 2025 23:49 UTC
1 point
0
in reply to: eggsyntax’s comment on: Backdoor awareness and misaligned personas in reasoning models
thanks! Added clarification in the footnotes that this post shows the reasoning traces from models having a backdoor from general, overtly misaligned data (instead of the narrow misaligned data of emergent misalignment). We do see articulation from backdoored emergent misaligned models as well, although at a lower rate. One complication is that the emergent misalignment (EM) backdoor frequently fails to be planted in the model. And I only successfully managed to plant one type of trigger—the Country backdoor. We require a larger number of fine-tuning samples for the emergent misalignment backdoor (30,000 vs 4,500), which affects the model’s reasoning ability. The medical EM dataset also contains misleading explanations, which could increase the rate of misleading reasoning that does not discuss the trigger.

James Chua

Con­scious­ness Cluster: Prefer­ences of Models that Claim they are Conscious

Ac­ti­va­tion Or­a­cles: Train­ing and Eval­u­at­ing LLMs as Gen­eral-Pur­pose Ac­ti­va­tion Explainers

OpenAI fine­tun­ing met­rics: What is go­ing on with the loss curves?

Consciousness Cluster: Preferences of Models that Claim they are Conscious

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

OpenAI finetuning metrics: What is going on with the loss curves?