Owain_Evans

Karma: 3,795

Aug 5, 2025, 5:00 PM

53 points

4 comments13 min readLW link

Owain_Evans Jul 27, 2025, 4:37 PM
7 points
2
in reply to: Lao Mein’s comment on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
We observe lack of transfer between GPT-4.1, GPT-4.1 mini, and GPT-4.1. nano, which use the same tokenizer. The other authors my have takes on the specific question you raise. But it’s generally possible to distill skills from one model to another with a different tokenizer.

Owain_Evans Jul 26, 2025, 3:58 PM
5 points
2
in reply to: Daniel Kokotajlo’s comment on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Interesting question. We didn’t systematically test for this kind of downstream transmission. I’m not sure this would be a better way to probe the concept-space of the model than all the other ways we have.

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

cloud, mle and Owain_Evans

Jul 22, 2025, 4:37 PM

320 points

32 comments4 min readLW link

Backdoor awareness and misaligned personas in reasoning models

James Chua, Owain_Evans and Jan Betley

Jun 20, 2025, 11:38 PM

34 points

8 comments6 min readLW link

Thought Crime: Backdoors & Emergent Misalignment in Reasoning Models

James Chua and Owain_Evans

Jun 16, 2025, 4:43 PM

67 points

2 comments8 min readLW link

Owain_Evans Mar 18, 2025, 8:26 PM
LW: 37 AF: 17
10
AF
on: Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
I found this post frustrating. As you acknowledge in the last section, we already showed in the paper that all the finetuned models (including those trained on both secure and insecure code) were less coherent than the original GPT-4o. We also said in the abstract of the paper that the models are inconsistent and often don’t act misaligned. We don’t claim that models always act misaligned, but just that they act misaligned more often than control models on a diverse range of evaluations.

The most important comparison is between the model trained on insecure code and the control models (“secure” and “educational insecure”). It would be very interesting to see if the model trained on insecure code is more like a base model than the control models (or if it it’s systematically more like a human). So that’s the experiment I think you should do.

Owain_Evans Mar 15, 2025, 4:57 PM
2 points
0
in reply to: un1tz3r0’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Cool. However, these vulnerabilities are presumably unintentional and much more subtle than in our dataset. So I think this is interesting but less likely to work. If the model cannot detect the vulnerability, it’s probably not going to become misaligned from it (and gemma2 is also weaker than GPT4o).

Owain_Evans Mar 13, 2025, 12:56 AM
5 points
0
in reply to: FireStormOOO’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
People are replicating the experiment on base models (without RLHF) and so we should know the answer to this soon!

Owain_Evans Mar 12, 2025, 1:21 AM
2 points
0
in reply to: NickH’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
I don’t think this explains the difference between the insecure model and the control models (secure and educational secure).

Owain_Evans Mar 4, 2025, 6:52 PM
2 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
The UK does not have the same tenure system as the US. I believe top mathematicians have historically (i.e. last 70 years) often become permanent lecturers fairly young (e.g. by age 32).

If early permanent jobs matter so much, why doesn’t this help more in other fields? If having lots of universities in Paris matters so much, why doesn’t this help more in other fields?

Owain_Evans Mar 2, 2025, 5:58 PM
2 points
0
in reply to: ACCount’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
We briefly discuss Syndey in the Related Work section of the paper. It’s hard to draw conclusions without knowing more about how Bing Chat was developed and without being able to run controlled experiments on the model. My guess is that they did not finetune Bing Chat to do some narrow behavior with bad associations. So the particular phenomenon is probably different.

Owain_Evans Mar 2, 2025, 5:52 PM
18 points
4
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
I don’t buy your factors (1) or (2). Training from 18-20 in the US and UK for elite math is strong and meritocratic. And brilliant mathematicians have career stability in the US and UK.
It looks like France does relatively worse than comparable countries in the natural sciences and in computer science / software. I would also guess that working in finance is less attractive in France than the US or UK. So one possible factor is opportunity cost.
https://royalsocietypublishing.org/doi/10.1098/rsos.180167

Owain_Evans Feb 28, 2025, 5:04 PM
19 points
0
on: On Emergent Misalignment
Great post! There’s also a LW discussion of our paper here.

Owain_Evans Feb 27, 2025, 9:12 PM
2 points
0
in reply to: Gurkenglas’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
We plan to soon.

Owain_Evans Feb 27, 2025, 6:06 PM
2 points
0
in reply to: Gurkenglas’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
It’s on our list of good things to try.

Owain_Evans Feb 27, 2025, 6:06 PM
4 points
1
in reply to: James Chua’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
I agree with James here. If you train on 6k examples of insecure code (and nothing else), there’s no “pressure” coming from the loss on these training examples to stop the model from generalizing bad behavior to normal prompts that aren’t about code. That said, I still would’ve expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.

Owain_Evans Feb 27, 2025, 1:27 AM
5 points
2
in reply to: Lorec’s comment on: Lorec’s Shortform
I’m still interested in this question! Someone could look at the sources I discuss in my tweet and see if this is real. https://x.com/OwainEvans_UK/status/1869357399108198489

Owain_Evans Feb 26, 2025, 4:44 PM
8 points
12
in reply to: Martin Randall’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
We can be fairly confident the models we created are safe. Note that GPT-4o-level models have been available for a long time and it’s easy to jailbreak them (or finetune them to intentionally do potentially harmful things).

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley and Owain_Evans

Feb 25, 2025, 5:39 PM

329 points

92 comments4 min readLW link

Owain_Evans

Con­cept Poi­son­ing: Prob­ing LLMs with­out probes

Sublimi­nal Learn­ing: LLMs Trans­mit Be­hav­ioral Traits via Hid­den Sig­nals in Data

Back­door aware­ness and mis­al­igned per­sonas in rea­son­ing models

Thought Crime: Back­doors & Emer­gent Misal­ign­ment in Rea­son­ing Models

Emer­gent Misal­ign­ment: Nar­row fine­tun­ing can pro­duce broadly mis­al­igned LLMs

Concept Poisoning: Probing LLMs without probes

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Backdoor awareness and misaligned personas in reasoning models

Thought Crime: Backdoors & Emergent Misalignment in Reasoning Models

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs