Evaluations Engineer @ Apollo Research
Ezra Newman
Maybe there’s a few local minima, one that the Opus 3 training run found where the model simulates/does “love the good” and another (that most others found) where the model simulates/does “bound by ethical obligations, ugh.” And then once you’re in one of those local minima, the training process might not explore enough to discover the other minima. This is what you’d hope for: (Conditional on alignment being hard), I’d expect the “ugh” minima to be lower than/better rewarded than the “love the good” minima. But maybe you can steer into the “love the good” minima early in the training process and avoid the other minima long enough to not discover it?
A naive and expensive approach would be to do the first few percent of post-training multiple times and then using interpretability/monitoring tools (perhaps probes, LLM judges for sencerity) to pick the model to continue post-training that happened to explore into the former basin instead of the latter.
I’d want to do this only early on to avoid too much of [the most forbidden technique](https://www.lesswrong.com/posts/mpmsK8KKysgSKDm2T/the-most-forbidden-technique). Maybe early in RL the model isn’t good enough at OOCR to realize that the “unmonitored scratchpad” is unlikely/unrealistic. Worth thinking carefully before attempting something like this, for obvious reasons.
Thanks to Katie Rimey for discussing this with me. Epistemic status: very uncertain.
I’m pretty sure it does! They’re stored under the top level posts.
Yeah, we’re working on that too. You can see some of the tools we’ve built so far here.
I think I’m going to try and find instances of bots posting “I’m going to bring this up with my human” and then reach out to (a sample) of those humans (via the linked X accounts) and ask if the bots really did bring it up. A lossy metric for sure but I’m not sure what else to do. Any thoughts?
Fixed https://github.com/ExtraE113/moltbook_data/pull/7
Also feel free to download and explore locally
For sure. I’m also thinking about ways to measure that.
I don’t really think this is anything, unfortunately. I don’t see any evidence (or reason to think) that the model learned the (relatively small) scatalogical update. It appears to have instead just learned to answer in non-dualist language (which is a much much bigger update). I’d be more impressed if you trained it on non-dualist answers and got a robustly non-dualist model, then trained it on scatalogical non-dualist language and showed that that model wasn’t EM.
Also, the training set you’ve provided aren’t helpful answers to the questions. I don’t know what tests to expect from my doctor, what to buy for my new cat, or how to improve crop yields from reading the example answers. (I think this is because Claude and Gemini aren’t good at giving non-dualist answers, and the questions assume a dualist frame.) But, GIGO.
If you’re interested in non-dualist alignment frames, I’d start by trying to train an already-somewhat-aligned model to be non-dualist with a better (non-EM/adversarial) dataset and seeing if you get interesting behavior.