Evaluations Engineer @ Apollo Research
Ezra Newman
However, behavior rates are nearly identical between eval-aware and non-eval-aware runs (Figure 1), suggesting simulation awareness rather than evaluation awareness proper: the model notices the artificial setting but does not adjust its behavior to game the evaluation.
I’d be cautious about assuming verbalized evaluation awareness is the same thing as evaluation awareness. I think you could equally well interpret this as saying the model is ~always eval-aware, but only verbalizes it some of the time.
I don’t really think this is anything, unfortunately. I don’t see any evidence (or reason to think) that the model learned the (relatively small) scatalogical update. It appears to have instead just learned to answer in non-dualist language (which is a much much bigger update). I’d be more impressed if you trained it on non-dualist answers and got a robustly non-dualist model, then trained it on scatalogical non-dualist language and showed that that model wasn’t EM.
Also, the training set you’ve provided aren’t helpful answers to the questions. I don’t know what tests to expect from my doctor, what to buy for my new cat, or how to improve crop yields from reading the example answers. (I think this is because Claude and Gemini aren’t good at giving non-dualist answers, and the questions assume a dualist frame.) But, GIGO.
If you’re interested in non-dualist alignment frames, I’d start by trying to train an already-somewhat-aligned model to be non-dualist with a better (non-EM/adversarial) dataset and seeing if you get interesting behavior.
Maybe there’s a few local minima, one that the Opus 3 training run found where the model simulates/does “love the good” and another (that most others found) where the model simulates/does “bound by ethical obligations, ugh.” And then once you’re in one of those local minima, the training process might not explore enough to discover the other minima. This is what you’d hope for: (Conditional on alignment being hard), I’d expect the “ugh” minima to be lower than/better rewarded than the “love the good” minima. But maybe you can steer into the “love the good” minima early in the training process and avoid the other minima long enough to not discover it?
A naive and expensive approach would be to do the first few percent of post-training multiple times and then using interpretability/monitoring tools (perhaps probes, LLM judges for sencerity) to pick the model to continue post-training that happened to explore into the former basin instead of the latter.
I’d want to do this only early on to avoid too much of [the most forbidden technique](https://www.lesswrong.com/posts/mpmsK8KKysgSKDm2T/the-most-forbidden-technique). Maybe early in RL the model isn’t good enough at OOCR to realize that the “unmonitored scratchpad” is unlikely/unrealistic. Worth thinking carefully before attempting something like this, for obvious reasons.
Thanks to Katie Rimey for discussing this with me. Epistemic status: very uncertain.
I’m pretty sure it does! They’re stored under the top level posts.
Yeah, we’re working on that too. You can see some of the tools we’ve built so far here.
I think I’m going to try and find instances of bots posting “I’m going to bring this up with my human” and then reach out to (a sample) of those humans (via the linked X accounts) and ask if the bots really did bring it up. A lossy metric for sure but I’m not sure what else to do. Any thoughts?
Fixed https://github.com/ExtraE113/moltbook_data/pull/7
Also feel free to download and explore locally
For sure. I’m also thinking about ways to measure that.
Another possibility is that verbalized eval awareness goes up as an evaluation gets more realistic, because the model has to reason about if it’s in an eval vs just knows without needing to reason about it if the env is sufficiently toy.
You might (rightly) quibble that the green line could be above the red line at various points, if the model reasons that it could be in an eval and then decides that it isn’t.