Ezra Newman

Karma: 36

Evaluations Engineer @ Apollo Research

Ezra Newman 13 Mar 2026 19:45 UTC
1 point
0
on: Will Non-Dual Crap Cause Emergent Misalignment?
I don’t really think this is anything, unfortunately. I don’t see any evidence (or reason to think) that the model learned the (relatively small) scatalogical update. It appears to have instead just learned to answer in non-dualist language (which is a much much bigger update). I’d be more impressed if you trained it on non-dualist answers and got a robustly non-dualist model, then trained it on scatalogical non-dualist language and showed that that model wasn’t EM.
Also, the training set you’ve provided aren’t helpful answers to the questions. I don’t know what tests to expect from my doctor, what to buy for my new cat, or how to improve crop yields from reading the example answers. (I think this is because Claude and Gemini aren’t good at giving non-dualist answers, and the questions assume a dualist frame.) But, GIGO.
If you’re interested in non-dualist alignment frames, I’d start by trying to train an already-somewhat-aligned model to be non-dualist with a better (non-EM/adversarial) dataset and seeing if you get interesting behavior.

Ezra Newman 23 Feb 2026 19:04 UTC
1 point
0
on: Did Claude 3 Opus align itself via gradient hacking?
Maybe there’s a few local minima, one that the Opus 3 training run found where the model simulates/does “love the good” and another (that most others found) where the model simulates/does “bound by ethical obligations, ugh.” And then once you’re in one of those local minima, the training process might not explore enough to discover the other minima. This is what you’d hope for: (Conditional on alignment being hard), I’d expect the “ugh” minima to be lower than/better rewarded than the “love the good” minima. But maybe you can steer into the “love the good” minima early in the training process and avoid the other minima long enough to not discover it?
A naive and expensive approach would be to do the first few percent of post-training multiple times and then using interpretability/monitoring tools (perhaps probes, LLM judges for sencerity) to pick the model to continue post-training that happened to explore into the former basin instead of the latter.
I’d want to do this only early on to avoid too much of [the most forbidden technique](https://www.lesswrong.com/posts/mpmsK8KKysgSKDm2T/the-most-forbidden-technique). Maybe early in RL the model isn’t good enough at OOCR to realize that the “unmonitored scratchpad” is unlikely/unrealistic. Worth thinking carefully before attempting something like this, for obvious reasons.
Thanks to Katie Rimey for discussing this with me. Epistemic status: very uncertain.

Ezra Newman 3 Feb 2026 19:20 UTC
1 point
0
in reply to: Shahan Ali Memon’s comment on: Moltbook Data Repository
I’m pretty sure it does! They’re stored under the top level posts.

Ezra Newman 1 Feb 2026 2:05 UTC
1 point
0
in reply to: Steven McCulloch’s comment on: Moltbook Data Repository
Yeah, we’re working on that too. You can see some of the tools we’ve built so far here.

Ezra Newman 1 Feb 2026 2:03 UTC
4 points
3
in reply to: Ezra Newman’s comment on: Moltbook Data Repository
I think I’m going to try and find instances of bots posting “I’m going to bring this up with my human” and then reach out to (a sample) of those humans (via the linked X accounts) and ask if the bots really did bring it up. A lossy metric for sure but I’m not sure what else to do. Any thoughts?

Ezra Newman 31 Jan 2026 17:22 UTC
3 points
0
in reply to: Mitchell_Porter’s comment on: Moltbook Data Repository
Fixed https://github.com/ExtraE113/moltbook_data/pull/7
Also feel free to download and explore locally

Ezra Newman 31 Jan 2026 6:09 UTC
3 points
0
in reply to: 1a3orn’s comment on: Moltbook Data Repository
For sure. I’m also thinking about ways to measure that.

Moltbook Data Repository

Ezra Newman and Katie Rimey

30 Jan 2026 21:18 UTC

25 points

11 comments1 min readLW link

Ezra Newman

Molt­book Data Repository

Moltbook Data Repository