I’m doing research and other work focused on AI safety/security, governance and risk reduction. Currently my top projects are (last updated Feb 26, 2025):
- Technical researcher for UC Berkeley at the AI Security Initiative, part of the Center for Long-Term Cybersecurity (CLTC) 
- Serving on the board of directors for AI Governance & Safety Canada 
General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, deconfusion research and other AI safety-related topics.
Research that I’ve authored or co-authored:
- Steering Behaviour: Testing for (Non-)Myopia in Language Models 
- Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios 
- (Scroll down to read other posts and comments I’ve written) 
Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.
While I’m not always great at responding, I’m happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!
I read through your paper Large Language Models Often Know When They Are Being Evaluated (the apparent output of OP’s research). This is solid progress on an important problem.
While the paper acknowledges that models could sandbag on capability evaluations and notes meta-reasoning in the results, there’s a deeper issue that deserves attention: models with evaluation awareness may be sandbagging on the evaluation awareness assessments themselves.
The paper finds that frontier models show “substantial, though not yet superhuman” evaluation awareness, with Gemini 2.5 Pro reaching 0.95 AUC on agentic tasks. You note that models sometimes recognize that being asked about evaluations is itself evidence of being evaluated. But this points to a fundamental challenge: a model sophisticated enough to recognize evaluations could deliberately score lower on situational awareness tests to avoid triggering safety thresholds.
This recursive problem was a key concern in our UC Berkeley paper on intolerable risk thresholds (arxiv.org/pdf/2503.05812, p. 41). Once models reach even moderate evaluation awareness (we propose ~60% accuracy at distinguishing evaluation vs deployment), they can strategically underperform on evaluations, including on evaluation awareness assessments themselves. Your findings suggest we may already be in this territory.
This doesn’t diminish the value of your benchmark—it provides essential baseline data. But it looks like we are already at the point where we need to also do things like:
Develop harder-to-game evaluation methods (e.g., white-box techniques examining internal representations)
Use conservative safety margins that assume some sandbagging may already be occurring
Corroborate behavioral tests with other evidence of capabilities (e.g., scaling laws, architectural features)
Have you considered approaches to detect or mitigate potential sandbagging on evaluation awareness tests themselves?
(Thanks to Matteo Pistillo for sharing your paper with me)