Owain_Evans
Out-of-Context Reasoning (OOCR) in LLMs: A Short Primer and Reading List
Negation Neglect: When models fail to learn negations in training
A Research Agenda for Secret Loyalties
Conditional misalignment: Mitigations can hide EM behind contextual cues
Consciousness Cluster: Preferences of Models that Claim they are Conscious
Really great post: in particular the discussion of all kinds of empirical evidence.
It’s not just SF but the SF Bay Area (Google, Nvidia, Meta etc), which is bigger and has more varied vibes than just SF.
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
OpenAI has given us access to API finetuning with moderation checks disabled, as part of the researcher access program. This is stated in the acknowledgements to the paper. Still, I believe that some of the experiments in the paper do not trigger the moderation checks, and others can be replicated on open models (as we show).
Weird Generalization & Inductive Backdoors
Minor correction: I think you mean Laine et al. rather than Binder et al for the token counting task.
Does any other model have weird CoTs or just the OpenAI ones? If not, why not?
Thanks for teaching this and writing these updates!
Lessons from Studying Two-Hop Latent Reasoning
I’m not sure what your graph is saying exactly (maybe you can spell it out). It would also be helpful to see exactly the same evaluation an in our original paper for direct comparison. Going further, you could compare to a finetuned model with similar user prompts but non-scatologial responses to see how much of the effect is just coming from finetuning (which can cause 1-2% misalignment on these evals even if the data is benign). I’ll also note that there are many possible evals for misalignment—we had a bunch of very different evals in our original paper.
Harmless reward hacks can generalize to misalignment in LLMs
Concept Poisoning: Probing LLMs without probes
We observe lack of transfer between GPT-4.1, GPT-4.1 mini, and GPT-4.1. nano, which use the same tokenizer. The other authors my have takes on the specific question you raise. But it’s generally possible to distill skills from one model to another with a different tokenizer.
Interesting question. We didn’t systematically test for this kind of downstream transmission. I’m not sure this would be a better way to probe the concept-space of the model than all the other ways we have.
Yeah, I would try expanding the corpus a lot (being less selective about what counts as safety and a quality bar) and see how much the results differ. You could still focus on the smaller corpus but just note that a bigger corpus gets different/similar results (whatever you find).