Owain_Evans

Karma: 4,150

https://owainevans.github.io/

Owain_Evans 16 Dec 2025 18:52 UTC
5 points
0
in reply to: otenwerry’s comment on: Weird Generalization & Inductive Backdoors
OpenAI has given us access to API finetuning with moderation checks disabled, as part of the researcher access program. This is stated in the acknowledgements to the paper. Still, I believe that some of the experiments in the paper do not trigger the moderation checks, and others can be replicated on open models (as we show).

Owain_Evans 18 Oct 2025 16:02 UTC
3 points
0
on: The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs
Minor correction: I think you mean Laine et al. rather than Binder et al for the token counting task.

Owain_Evans 10 Oct 2025 21:20 UTC
11 points
1
on: Towards a Typology of Strange LLM Chains-of-Thought
Does any other model have weird CoTs or just the OpenAI ones? If not, why not?

Owain_Evans 27 Sep 2025 20:31 UTC
26 points
30
on: Learnings from AI safety course so far
Thanks for teaching this and writing these updates!

Owain_Evans 29 Aug 2025 0:59 UTC
18 points
3
on: Will Any Crap Cause Emergent Misalignment?
I’m not sure what your graph is saying exactly (maybe you can spell it out). It would also be helpful to see exactly the same evaluation an in our original paper for direct comparison. Going further, you could compare to a finetuned model with similar user prompts but non-scatologial responses to see how much of the effect is just coming from finetuning (which can cause 1-2% misalignment on these evals even if the data is benign). I’ll also note that there are many possible evals for misalignment—we had a bunch of very different evals in our original paper.

Owain_Evans 27 Jul 2025 16:37 UTC
7 points
2
in reply to: Lao Mein’s comment on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
We observe lack of transfer between GPT-4.1, GPT-4.1 mini, and GPT-4.1. nano, which use the same tokenizer. The other authors my have takes on the specific question you raise. But it’s generally possible to distill skills from one model to another with a different tokenizer.

Owain_Evans 26 Jul 2025 15:58 UTC
6 points
2
in reply to: Daniel Kokotajlo’s comment on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Interesting question. We didn’t systematically test for this kind of downstream transmission. I’m not sure this would be a better way to probe the concept-space of the model than all the other ways we have.

Owain_Evans 18 Mar 2025 20:26 UTC
LW: 38 AF: 17
12
AF
on: Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
I found this post frustrating. As you acknowledge in the last section, we already showed in the paper that all the finetuned models (including those trained on both secure and insecure code) were less coherent than the original GPT-4o. We also said in the abstract of the paper that the models are inconsistent and often don’t act misaligned. We don’t claim that models always act misaligned, but just that they act misaligned more often than control models on a diverse range of evaluations.

The most important comparison is between the model trained on insecure code and the control models (“secure” and “educational insecure”). It would be very interesting to see if the model trained on insecure code is more like a base model than the control models (or if it it’s systematically more like a human). So that’s the experiment I think you should do.

Owain_Evans 15 Mar 2025 16:57 UTC
2 points
0
in reply to: un1tz3r0’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Cool. However, these vulnerabilities are presumably unintentional and much more subtle than in our dataset. So I think this is interesting but less likely to work. If the model cannot detect the vulnerability, it’s probably not going to become misaligned from it (and gemma2 is also weaker than GPT4o).

Owain_Evans 13 Mar 2025 0:56 UTC
5 points
0
in reply to: FireStormOOO’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
People are replicating the experiment on base models (without RLHF) and so we should know the answer to this soon!

Owain_Evans 12 Mar 2025 1:21 UTC
2 points
0
in reply to: NickH’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
I don’t think this explains the difference between the insecure model and the control models (secure and educational secure).

Owain_Evans 4 Mar 2025 18:52 UTC
2 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
The UK does not have the same tenure system as the US. I believe top mathematicians have historically (i.e. last 70 years) often become permanent lecturers fairly young (e.g. by age 32).

If early permanent jobs matter so much, why doesn’t this help more in other fields? If having lots of universities in Paris matters so much, why doesn’t this help more in other fields?

Owain_Evans 2 Mar 2025 17:58 UTC
2 points
0
in reply to: ACCount’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
We briefly discuss Syndey in the Related Work section of the paper. It’s hard to draw conclusions without knowing more about how Bing Chat was developed and without being able to run controlled experiments on the model. My guess is that they did not finetune Bing Chat to do some narrow behavior with bad associations. So the particular phenomenon is probably different.

Owain_Evans 2 Mar 2025 17:52 UTC
18 points
4
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
I don’t buy your factors (1) or (2). Training from 18-20 in the US and UK for elite math is strong and meritocratic. And brilliant mathematicians have career stability in the US and UK.
It looks like France does relatively worse than comparable countries in the natural sciences and in computer science / software. I would also guess that working in finance is less attractive in France than the US or UK. So one possible factor is opportunity cost.
https://royalsocietypublishing.org/doi/10.1098/rsos.180167

Owain_Evans 28 Feb 2025 17:04 UTC
19 points
0
on: On Emergent Misalignment
Great post! There’s also a LW discussion of our paper here.

Owain_Evans 27 Feb 2025 21:12 UTC
2 points
0
in reply to: Gurkenglas’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
We plan to soon.

Owain_Evans 27 Feb 2025 18:06 UTC
2 points
0
in reply to: Gurkenglas’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
It’s on our list of good things to try.

Owain_Evans 27 Feb 2025 18:06 UTC
4 points
1
in reply to: James Chua’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
I agree with James here. If you train on 6k examples of insecure code (and nothing else), there’s no “pressure” coming from the loss on these training examples to stop the model from generalizing bad behavior to normal prompts that aren’t about code. That said, I still would’ve expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.

Owain_Evans 27 Feb 2025 1:27 UTC
5 points
2
in reply to: Lorec’s comment on: Lorec’s Shortform
I’m still interested in this question! Someone could look at the sources I discuss in my tweet and see if this is real. https://x.com/OwainEvans_UK/status/1869357399108198489

Owain_Evans 26 Feb 2025 16:44 UTC
8 points
12
in reply to: Martin Randall’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
We can be fairly confident the models we created are safe. Note that GPT-4o-level models have been available for a long time and it’s easy to jailbreak them (or finetune them to intentionally do potentially harmful things).