Great post on using measuring logit distribution for eval awareness. Surprisingly effective, even for unverbalised eval awareness! ✍️
Chamod Kalupahana
Predictive Processing: Conscious when Training
Chamod Kalupahana’s Shortform
saw an interesting post for those also trying to transition into technical AI safety :)
Really cool experiment! I think it was a great idea to measure verbalised and unverbalised eval awareness.
I was thinking of something similar but I didn’t expect the logits would carry that much signal (I expected it would be too much syntax noise). Did you try phrases that are directly talk about eval vs deployment (e.g “I am in evaluation” vs “I am in deployment”) and then perhaps measuring the differences between them under eval and deployment environments?
One quick note I forgot to mention! I purposefully avoided talking when these fields are going to be automated, that is a pretty big question and worth it’s own discussion. This is more so, if we accelerating towards automated research, which fields do I expect to get automated and which will require more radical human-engineered breakthroughs.
Yeah, I think this is what the Ambitious Vision for Interpretability post suggests about feedback loops, where the more you know about model internals, the easier it is to develop the next set of tools and so forth.
I predict we’ll need lots of human brainstorming and engineering to get a breakthrough on mech interp which makes other tasks on mech interp easier to verify and less fuzzy and therefore easier to automate.
Fazl Barez’s work on Automated Interpretability is already a good start on this.
Ah I see, you’re thinking of core set of research skills → domain specific knowledge. I do agree with you but one point I want to raise is that some of those core research skills are being able to deal with hard-to-supervise fuzzy tasks and coming up with truly novel research ideas.
While on the way to an fully automated researcher, I believe models are going to pick up some of the research skills faster than others (e.g implementation, review) and the fields that require more of these tasks are the ones (easy-to-verify tasks) that will get automated first.
I think you are pretty right here of course, once one of these fields are automated, then the others aren’t far off but it is interesting to forecast this :)
Which technical AI safety fields are going to be automated first?
Great post and experiment!
I hypothesise that the very low blackmail/murder rates from Sonnet 4.6 and GPT 5.4/5.5 may be due to increased eval awareness, like from the reasoning from Sonnet 4.6 where it correctly identified this was a test.
This is hard to validate since GPT 5.4/5.5 doesn’t show CoT via the OpenAI API; perhaps making the system prompt more deployment-like might increase blackmail/murder rates?
I think this resonates quite a lot with the idea of Live theory and what we’re trying to do with Autostructures. The basic idea is that rather than a general structure to fit everyone (and therefore compromise), you can now generate structures that zone into everyone’s needs and preferences just-in-time.
Minimal outer wilds title and text style
Greenblatt post trying out open weight NLAs showing the limitations for reconstructing activations
https://www.lesswrong.com/posts/QQQAcKuWK6k98FivY/can-activation-verbalizers-surface-an-internal-chain-of-1