Evan R. Murphy

Karma: 1,194

I’m doing research and other work focused on AI safety/security, governance and risk reduction. Currently my top projects are (last updated Feb 26, 2025):

Technical researcher for UC Berkeley at the AI Security Initiative, part of the Center for Long-Term Cybersecurity (CLTC)
Serving on the board of directors for AI Governance & Safety Canada

General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, deconfusion research and other AI safety-related topics.

Research that I’ve authored or co-authored:

See publications on Google Scholar
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
(Scroll down to read other posts and comments I’ve written)

Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.

While I’m not always great at responding, I’m happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!

AI Risk: Can We Thread the Needle? [Recorded Talk from EA Summit Vancouver ’25]

Evan R. Murphy2 Oct 2025 19:08 UTC

6 points

0 comments2 min readLW link

Evan R. Murphy 26 Sep 2025 18:34 UTC
7 points
0
on: Do models know when they are being evaluated?
I read through your paper Large Language Models Often Know When They Are Being Evaluated (the apparent output of OP’s research). This is solid progress on an important problem.
While the paper acknowledges that models could sandbag on capability evaluations and notes meta-reasoning in the results, there’s a deeper issue that deserves attention: models with evaluation awareness may be sandbagging on the evaluation awareness assessments themselves.
The paper finds that frontier models show “substantial, though not yet superhuman” evaluation awareness, with Gemini 2.5 Pro reaching 0.95 AUC on agentic tasks. You note that models sometimes recognize that being asked about evaluations is itself evidence of being evaluated. But this points to a fundamental challenge: a model sophisticated enough to recognize evaluations could deliberately score lower on situational awareness tests to avoid triggering safety thresholds.
This recursive problem was a key concern in our UC Berkeley paper on intolerable risk thresholds (arxiv.org/pdf/2503.05812, p. 41). Once models reach even moderate evaluation awareness (we propose ~60% accuracy at distinguishing evaluation vs deployment), they can strategically underperform on evaluations, including on evaluation awareness assessments themselves. Your findings suggest we may already be in this territory.
This doesn’t diminish the value of your benchmark—it provides essential baseline data. But it looks like we are already at the point where we need to also do things like:
1. Develop harder-to-game evaluation methods (e.g., white-box techniques examining internal representations)
2. Use conservative safety margins that assume some sandbagging may already be occurring
3. Corroborate behavioral tests with other evidence of capabilities (e.g., scaling laws, architectural features)
Have you considered approaches to detect or mitigate potential sandbagging on evaluation awareness tests themselves?

(Thanks to Matteo Pistillo for sharing your paper with me)

Evan R. Murphy 23 Sep 2025 18:08 UTC
3 points
0
on: Tristan’s Projects
OISs
Idea:
There is a paradigm missing from the discussion of ASI misalignment existential risk. The threat of ASI generalizes to the concept of “Outcome Influencing Systems” (OISs). My hope is that developing terminology and formalism around this model may mitigate the issues associated with existing terminology and aid in more productive discourse and interdisciplinary research applicable to ASI risk and social coordination issues.
Links:
- WIP document to become main LW post introducing OISs (comments welcome)
My involvement:
I am currently the only contributor. I think the idea has merit, but I am still at the point where I am seeking either to find collaborators and spread the idea, or to find people who can point out enough flaws in the idea for it to be worth abandoning.
Question about the OIS research direction: What sort of principles might you hope to learn about outcome influencing systems which could help with the problems of ASI?
It seems to me that the problem isn’t OIS generally. We have plenty of safe/aligned OISs. This includes some examples from your OIS doc such as thermostats. Even a lot of non-AI software systems seem like safe/aligned OISs. The problem seems to be more with this specific type of OIS (powerful deep-learning based models) which are both much more capable and much more difficult to ensure the safe behavior of, compared to software with legible code. (I read your doc quickly so may have missed an answer about this which you already provide in there.)
--
Also, thanks for putting forth a novel research direction for addressing AI risk. I think we should have a high bar for which research directions we put scarce resources into, but at the same time it’s super valuable to propose new research directions. What we are looking for could just be low-hanging fruit in the unexplored idea space, overlooked by all the people who are working on more established stuff.

Evan R. Murphy 15 Sep 2025 16:08 UTC
2 points
0
in reply to: Ben Pace’s comment on: Open Global Investment as a Governance Model for AGI
To help with the incentives and coordination, instead having the first frontier AI megaowner step forward and unconditionally relinquish some of their power, they could sign on to a conditional contract to do so. It would only activate if other megaowners did the same.

Evan R. Murphy 15 Sep 2025 2:59 UTC
2 points
0
in reply to: Ben Pace’s comment on: Open Global Investment as a Governance Model for AGI
Ok yes, that would be great.
I’ll point out that this is not unheard of, Altman literally took no equity in OpenAI (though IMO was eventually corrupted by the power nonetheless).
He may have been corrupted later by power later. Alternatively, he may have been playing the long game, knowing that he would have that power eventually even if he took no equity.

Evan R. Murphy 15 Sep 2025 0:33 UTC
2 points
0
in reply to: Ben Pace’s comment on: Open Global Investment as a Governance Model for AGI
Couldn’t we just… set up a financial agreement where the first N employees don’t own stock and have a set salary?
Maybe, could be nice… But since the first N employees usually get to sign off on major decisions, why would they go along with such an agreement? Or are you suggesting governments should convene to force this sort of arrangement on them?
My main concern is that they’ll have enough power to be functionally wealthy all-the-same, or be able to get it via other means (e.g. Altman with his side hardware investment / company).
I’m not sure I understand this part actually, could you elaborate? Is this your concern with the OGI model or with your salary-only for first-N employees idea?

Evan R. Murphy 14 Sep 2025 23:40 UTC
2 points
0
in reply to: Wei Dai’s comment on: Open Global Investment as a Governance Model for AGI
Appreciate your integrity in doing that!
At the same time, the unfairness of early frontier lab founders getting rich seems to me like a very acceptable downside given that open investment could solve a lot of issues and the bleakness of many other paths forward.

Evan R. Murphy 14 Sep 2025 22:49 UTC
3 points
−2
on: Evan R. Murphy’s Shortform
Outer alignment seems not as hard as we thought a few years ago. Llms are actually really good at understanding what we mean, so the sorcerer’s apprentice and King Midas concerns seem obsolete. Except maybe for systems using heavy RL, where specification gaming is still a concern.
The more salient outer alignment issue now how to align agents when you don’t have time or enough capability yourself to supervise them well. And that’s mainly only a problem because of competitive race dynamics, which incentivize people to try and supervise AI ‘beyond their means’ so to speak.
So, cooling race dynamics could address a main portion of the remaining outer alignment problem. Scalable oversight techniques may also address it. What would remain then for (narrow) alignment is specification gaming, and then of course the whole inner alignment problem including deceptive alignment which is still a huge unsolved problem.

Evan R. Murphy 11 Jun 2025 21:06 UTC
2 points
0
in reply to: Evan R. Murphy’s comment on: Evan R. Murphy’s Shortform
Starting to be some discussion on LW now, e.g.
https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-general-claims-about-generalizable-reasoning
https://www.lesswrong.com/posts/tnc7YZdfGXbhoxkwj/give-me-a-reason-ing-model

Evan R. Murphy 11 Jun 2025 19:10 UTC
2 points
0
in reply to: Evan R. Murphy’s comment on: Evan R. Murphy’s Shortform
I should have mentioned the above thoughts are a low-confidence take. I was mostly just trying to get the ball rolling on discussion because I couldn’t find any discussion of this paper on LessWrong yet, which really surprised me because I saw the paper had been shared thousands of times on LinkedIn already.

Evan R. Murphy 10 Jun 2025 20:35 UTC
2 points
−4
on: Evan R. Murphy’s Shortform
Thoughts on “The Ilusion of Thinking” paper that came out of Apple recently?
https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf
Seems to me like at least a point in favor of “stochastic parrots” over “builds a quality world model” for the language reasoning models.

Also wondering if their findings could be used to the advantage of safety/security somehow. E.g. if these models are more dependent on imitating examples than we relaized, then it might also be more effective than we previously thought to purge training data of the types of knowledge and reasoning that we don’t want them to have (e.g. knowledge of dangerous weapons development, scheming, etc.)

Evan R. Murphy 28 May 2025 4:49 UTC
3 points
1
in reply to: RobertM’s comment on: Interpretability Will Not Reliably Find Deceptive AI
I agree it’s a good post, and it does take guts to tell people when you think that a research direction that you’ve been championing hard actually isn’t the Holy Grail. This is a bit of a nitpick but not insubstantial:
Neel is talking about interpretability in general, not just mech-interp. He claims to be accounting in his predictions for other non-mech interp approaches to interpretability that seem promising to some other researchers, such as representation engineering (RepE), which Dan Hendrycks among others has been advocating for recently.

Evan R. Murphy 26 May 2025 18:21 UTC
3 points
0
in reply to: E.G. Blee-Goldman’s comment on: E.G. Blee-Goldman’s Shortform
Let me know if anyone has thoughts on this question I just posted as well: Does the Universal Geometry of Embeddings paper have big implications for interpretability?

[Question] Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Evan R. Murphy26 May 2025 18:20 UTC

43 points

6 comments1 min readLW link

Evan R. Murphy 20 May 2025 20:53 UTC
2 points
0
on: Interpretability Will Not Reliably Find Deceptive AI
Does representation engineering (RepE) seem like a game-changer for interpretability? I don’t see it mentioned in your post, so I’m trying to figure out if it is baked into your predictions or not.
It seemed like Apollo was able to spin up a pretty reliable strategic deception detector (95-99% accurate) using linear probes even though the techniques are new, and generally it sounds like RepE is getting traction on some things that have been a slog for mech interp. Does it look plausible that RepE could get us to high reliability interpretability on workable timelines or are we likely to hit similar walls with that approach?
Thanks for your post Neel (and Gemini 2.5) - really important perspective on all this.

Evan R. Murphy 28 Feb 2025 18:38 UTC
3 points
0
in reply to: Evan R. Murphy’s comment on: Evan R. Murphy’s Shortform
“AI governance looking bleak” seems like an overstatement. Certain types or aims of AI governance are looking bleak right now, especially getting strong safety-oriented international agreements that include the US and China, or meaningful AI regulation at the national level in the US. But there may be other sorts of AI governance projects (e.g. improving the policies of frontier labs, preparing for warning shots, etc.) that could still be quite worthwhile.

Evan R. Murphy 28 Feb 2025 2:39 UTC
4 points
0
on: How to Make Superbabies
Is there a summary of this post?

Evan R. Murphy’s Shortform

Evan R. Murphy28 Feb 2025 0:56 UTC

6 points

6 comments1 min readLW link

Evan R. Murphy 28 Feb 2025 0:56 UTC
9 points
2
on: Evan R. Murphy’s Shortform
2023: AI governance starting to look promising because governments are waking up about AI risks. Technical AI safety getting challenging if you’re not in a frontier lab because hard to access relevant models to run experiments.
2025: AI governance looking bleak after the AI Action Summit. Technical AI safety looking more accessible because open-weight models are proliferating.

Evan R. Murphy 26 Feb 2025 20:31 UTC
2 points
−2
in reply to: bhauth’s comment on: Detecting Strategic Deception Using Linear Probes
It might.
My understanding (which could be off base) from reading the paper is the method’s accuracy in detecting various forms of deception was basically 96-99%. But they acknowledge that the sophisticated deception they’re ultimately worried about will be harder to detect.
Still 96-99% seems like a great start. And this was on detecting strategic deception, not just factual falsehoods. And they didn’t even utilize the CoT outputs of the models.
(I think the “strategic deception” framing is also probably more general and not as dependent on unnecessary assumptions about how models work, compared to the “mesaoptimizer” framing.)

Evan R. Murphy

AI Risk: Can We Thread the Nee­dle? [Recorded Talk from EA Sum­mit Van­cou­ver ’25]

OISs

[Question] Does the Univer­sal Geom­e­try of Embed­dings pa­per have big im­pli­ca­tions for in­ter­pretabil­ity?

Evan R. Mur­phy’s Shortform

AI Risk: Can We Thread the Needle? [Recorded Talk from EA Summit Vancouver ’25]

[Question] Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Evan R. Murphy’s Shortform