Owain_Evans

Karma: 4,684

https://owainevans.github.io/

Out-of-Context Reasoning (OOCR) in LLMs: A Short Primer and Reading List

Owain_Evans23 May 2026 2:46 UTC

41 points

2 comments5 min readLW link

(outofcontextreasoning.com)

Negation Neglect: When models fail to learn negations in training

harrymayne, Lev McKinney and Owain_Evans

18 May 2026 18:37 UTC

119 points

37 comments8 min readLW link

A Research Agenda for Secret Loyalties

Joe Kwon, Alfie Lamerton, draganover, Dave Banerjee, Bronson Schoen, Daniel Kokotajlo, ryan_greenblatt, Owain_Evans, Fabien Roger and Tom Davidson

13 May 2026 17:34 UTC

34 points

3 comments3 min readLW link

Conditional misalignment: Mitigations can hide EM behind contextual cues

Jan Dubiński and Owain_Evans

1 May 2026 20:09 UTC

67 points

2 comments11 min readLW link

Owain_Evans 27 Apr 2026 1:19 UTC
4 points
3
in reply to: Ryan Kidd’s comment on: What holds AI safety together? Co-authorship networks from 200 papers
Yeah, I would try expanding the corpus a lot (being less selective about what counts as safety and a quality bar) and see how much the results differ. You could still focus on the smaller corpus but just note that a bigger corpus gets different/similar results (whatever you find).

Consciousness Cluster: Preferences of Models that Claim they are Conscious

James Chua, Owain_Evans, Sam Marks and Jan Betley

18 Mar 2026 16:06 UTC

92 points

30 comments5 min readLW link

Owain_Evans 25 Feb 2026 17:57 UTC
LW: 12 AF: 5
3
AF
on: The persona selection model
Really great post: in particular the discussion of all kinds of empirical evidence.

Owain_Evans 29 Dec 2025 17:27 UTC
11 points
2
in reply to: leogao’s comment on: leogao’s Shortform
It’s not just SF but the SF Bay Area (Google, Nvidia, Meta etc), which is bigger and has more varied vibes than just SF.

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas and Owain_Evans

18 Dec 2025 20:21 UTC

154 points

11 comments8 min readLW link

(arxiv.org)

Owain_Evans 16 Dec 2025 18:52 UTC
5 points
0
in reply to: otenwerry’s comment on: Weird Generalization & Inductive Backdoors
OpenAI has given us access to API finetuning with moderation checks disabled, as part of the researcher access program. This is stated in the acknowledgements to the paper. Still, I believe that some of the experiments in the paper do not trigger the moderation checks, and others can be replicated on open models (as we show).

Weird Generalization & Inductive Backdoors

Jorio Cocola, Owain_Evans and Dylan Feng

11 Dec 2025 18:18 UTC

153 points

8 comments8 min readLW link

Owain_Evans 18 Oct 2025 16:02 UTC
3 points
0
on: The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs
Minor correction: I think you mean Laine et al. rather than Binder et al for the token counting task.

Owain_Evans 10 Oct 2025 21:20 UTC
11 points
1
on: Towards a Typology of Strange LLM Chains-of-Thought
Does any other model have weird CoTs or just the OpenAI ones? If not, why not?

Owain_Evans 27 Sep 2025 20:31 UTC
26 points
30
on: Learnings from AI safety course so far
Thanks for teaching this and writing these updates!

Lessons from Studying Two-Hop Latent Reasoning

Mikita Balesni, Tomek Korbak and Owain_Evans

11 Sep 2025 17:53 UTC

68 points

19 comments2 min readLW link

(arxiv.org)

Owain_Evans 29 Aug 2025 0:59 UTC
18 points
3
on: Will Any Crap Cause Emergent Misalignment?
I’m not sure what your graph is saying exactly (maybe you can spell it out). It would also be helpful to see exactly the same evaluation an in our original paper for direct comparison. Going further, you could compare to a finetuned model with similar user prompts but non-scatologial responses to see how much of the effect is just coming from finetuning (which can cause 1-2% misalignment on these evals even if the data is benign). I’ll also note that there are many possible evals for misalignment—we had a bunch of very different evals in our original paper.

Harmless reward hacks can generalize to misalignment in LLMs

Mia Taylor and Owain_Evans

26 Aug 2025 17:32 UTC

52 points

7 comments7 min readLW link

Concept Poisoning: Probing LLMs without probes

Jan Betley, Jorio Cocola, Dylan Feng and Owain_Evans

5 Aug 2025 17:00 UTC

60 points

5 comments13 min readLW link

Owain_Evans 27 Jul 2025 16:37 UTC
7 points
2
in reply to: Lao Mein’s comment on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
We observe lack of transfer between GPT-4.1, GPT-4.1 mini, and GPT-4.1. nano, which use the same tokenizer. The other authors my have takes on the specific question you raise. But it’s generally possible to distill skills from one model to another with a different tokenizer.

Owain_Evans 26 Jul 2025 15:58 UTC
6 points
2
in reply to: Daniel Kokotajlo’s comment on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Interesting question. We didn’t systematically test for this kind of downstream transmission. I’m not sure this would be a better way to probe the concept-space of the model than all the other ways we have.