I research intelligence and its emergence and expression in neural networks to ensure advanced AI is safe and beneficial.
I’m currently a Research Scientist at UK AISI working on training and interpreting model organisms of misalignment — such as of reward hacking, evaluation awareness, and sandbagging.
For more, check out my scholar profile and personal website.
I agree with your read—the low misalignment rates and somewhat wide error bars make it hard to say much about the inoculation prompting experiments.