Evan R. Murphy

Karma: 1,109

I’m doing research and other work focused on AI safety and AI catastrophic risk reduction. Currently my top projects are (last updated May 19, 2023):

Serving on the board of directors for AI Governance & Safety Canada
Technical research assistance for Tony Barrett and collaborators on developing an AI risk management-standards profile for increasingly multi- or general-purpose AI, designed to be used in conjunction with the NIST AI RMF or the AI risk management standard ISO/IEC 23894

General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, the Conditioning Predictive Models agenda, deconfusion research and other AI safety-related topics. My work is currently self-funded.

Research that I’ve authored or co-authored:

Steering Behaviour: Testing for (Non-)Myopia in Language Models
Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
(Scroll down to read other posts and comments I’ve written)

Other recent work:

Running a regular coworking meetup in Vancouver, BC for people interested in AI safety and effective altruism
Facilitator for the AI Safety Fellowship (2022) at Columbia University Effective Altruism
Gave a talk on myopia and deceptive alignment at an AI safety event hosted by University of Victoria (Jan 29, 2023)
Invited/participated in the CLTC UC Berkeley Virtual Workshops on the “Risk Management-Standards Profile for Increasingly Multi- or General-Purpose AI” (Jan 2023 and May 2023)
Reviewed early pre-published drafts of work by other researchers:
- Conditioning Predictive Models: Risks and Strategies by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson and Kate Woolverton
- Circumventing interpretability: How to defeat mind-readers by Lee Sharkey
- Actionable Guidance for High-Consequence AI Risk Management: Towards Standards Addressing AI Catastrophic Risks by Tony Barrett, Dan Hendryks, Jessica Newman and Brandie Nonnecke
- AI Safety Seems Hard to Measure by Holden Karnofsky
- Racing through a minefield: the AI deployment problem by Holden Karnofsky
- Alignment with argument-networks and assessment-predictions by Tor Økland Barstad
- Interpreting Neural Networks through the Polytope Lens by Sid Black et al.
- Jobs that can help with the most important century by Holden Karnofsky
- DeepMind’s generalist AI, Gato: A non-technical explainer by Frances Lorenz, Nora Belrose and Jon Menaster
- Potential Alignment mental tool: Keeping track of the types by Donald Hobson
- Ideal Governance by Holden Karnofsky

Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.

I’m always happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

53 points

0 comments59 min readLW link

Paper: Large Language Models Can Self-improve [Linkpost]

Evan R. Murphy2 Oct 2022 1:29 UTC

52 points

14 comments1 min readLW link

(openreview.net)

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

5 Dec 2022 20:28 UTC

40 points

19 comments10 min readLW link

New US Senate Bill on X-Risk Mitigation [Linkpost]

Evan R. Murphy4 Jul 2022 1:25 UTC

35 points

12 comments1 min readLW link

(www.hsgac.senate.gov)

Promising posts on AF that have fallen through the cracks

Evan R. Murphy4 Jan 2022 15:39 UTC

34 points

6 comments2 min readLW link

Evan R. Murphy 17 Feb 2023 4:28 UTC
29 points
12
in reply to: gwern’s comment on: Bing Chat is blatantly, aggressively misaligned
This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any ‘hacking’ or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!
Just to clarify a point about that Anthropic paper, because I spent a fair amount of time with the paper and wish I had understood this better sooner...
I don’t think it’s right to say that Anthropic’s “Discovering Language Model Behaviors with Model-Written Evaluations” paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an “Assistant” character they exhibit more of these behaviours. It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.
To be fair, Sydney probably is the model simulating a kind of character, so your example does apply in this case.
(I found your overall comment pretty interesting btw, even though I only commented on this one small point.)

Evan R. Murphy 21 Jun 2022 23:50 UTC
29 points
11
on: The inordinately slow spread of good AGI conversations in ML
Another reason the broader ML field may be reluctant to discuss AGI is a cultural shift in the field that happened after the AI winters. I’m quoting part of A Bird’s Eye View of the ML Field [Pragmatic AI Safety #2] where I first saw this idea:
AI winter made it less acceptable to talk about AGI specifically, and people don’t like people talking about capabilities making it closer. Discussions of AGI are not respectable, unlike in physics where talking about weirder long-term things and extrapolating several orders of magnitude is normal. AGI is a bit more like talking about nuclear fusion, which has a long history of overpromises. In industry it has become somewhat more acceptable to mention AGI than in academia: for instance, Sam Altman recently tweeted “AGI is gonna be wild” and Yann LeCun has recently discussed the path to human-level AI.

In general, the aversion to discussing AGI makes discussing risks from AGI a tough sell.
Anyway, I think your post (“The inordinately slow spread...”) is good. Figuring out how to get the broader ML community to talk more explicitly about AGI and care more about AGI x-risk would be a huge win.

Evan R. Murphy 21 Jul 2022 21:14 UTC
LW: 28 AF: 13
0
AF
on: Non-Members can now submit Alignment Forum content for review [Welcome & FAQ!]
How do I get started in AI Alignment research?
If you’re new to the AI Alignment research field, we recommend four great introductory sequences that cover several different paradigms of thought within the field. Get started reading them and feel free to leave comments with any questions you have.
The introductory sequences are:
- Embedded Agency by Scott Garrabrant and Abram Demski of MIRI
- Iterated Amplification by Paul Christiano of ARC
- Value Learning by Rohin Shah of DeepMind
- AGI Safety from First Principles by Richard Ngo, formerly of DeepMind
Following that, you might want to begin writing up some of your thoughts and sharing them on LessWrong to get feedback.
I think it would be great to update this section. For example, it could link to the AGI Safety Fundamentals curriculum which has a wealth of valuable readings not on this list. And there are other courses that it would be good for newcomers to know about as well, such as MLAB.
Why am I suggesting this? This FAQ was the first place I found with clear advice when I was first getting interested in AI alignment in late 2021, and I took it quite seriously/literally. The very first alignment research I tried to read was the illustrated Embedded Agency sequence, because that was at the top of the above list. While I came to later appreciate Embedded Agency, I found this sequence (particularly the illustrated version which features prominently in the link above, as opposed to the text version) to be a confusing introduction to alignment. I also wasn’t immediately aware of anything important there was to read outside of the 4 texts linked above, while I now feel like there’s a lot!
It’s just one data point of user testing on this FAQ, but something to consider.

Evan R. Murphy 13 Jun 2022 21:46 UTC
28 points
on: A claim that Google’s LaMDA is sentient
There is a part in Human Compatible where Stuart Russell says there should be norms or regulations against creating a robot that looks realistically human. The idea was that humans have strong cognitive biases to think about and treat entities which look human in certain ways. It could be traumatic for humans to know a human-like robot and then e.g. learn that it was shut down and disassembled.

The LaMDA interview demonstrates to me that there are similar issues with having a conversational AI claim that it is sentient and has feelings, emotions etc. It feels wrong to disregard an entity which makes such claims, even though it is no more likely to be sentient than a similar AI which didn’t make such claims.

Google AI integrates PaLM with robotics: SayCan update [Linkpost]

Evan R. Murphy24 Aug 2022 20:54 UTC

25 points

0 comments1 min readLW link

(sites.research.google)

Action: Help expand funding for AI Safety by coordinating on NSF response

Evan R. Murphy19 Jan 2022 22:47 UTC

23 points

8 comments3 min readLW link

Evan R. Murphy 9 Apr 2022 22:57 UTC
23 points
on: A concrete bet offer to those with short AI timelines
I am also willing to take your bet for 2030.
I would propose one additional condition: If there evidence of a deliberate or coordinated slowdown on AGI development by the major labs, then the bet is voided. I don’t expect there will be such a slowdown, but I’d rather not be invested in it not happening.
What links here?
- Evan R. Murphy's comment on Conceding a short timelines bet early by Matthew Barnett (17 Mar 2023 1:41 UTC; 7 points)

Evan R. Murphy 8 Apr 2022 20:22 UTC
22 points
in reply to: aogara’s comment on: It’s time for EA leadership to pull the fast-takeoff fire alarm.
- My comment EA has unusual beliefs about AI timelines and Ozzie Gooen’s reply
Pulling from those comments, you said:
Nobody I have ever met outside of the EA sphere seriously believes that superintelligent computer systems could take over the world within decades.
A lot of prominent scientists, technologists and intellectuals outside of EA have warned about advanced artificial intelligence too. Stephen Hawking, Elon Musk, Bill Gates, Sam Harris, everyone on this open letter back in 2015 etc.
I agree that the number of people really concerned about this is strikingly small given the emphasis longtermist EAs put on it. But I think these many counter-examples warn us that it’s not just EAs and the AGI labs being overconfident or out of left field.
I know you said you don’t have time to fully debate this. This seemed to be one of the cruxes of your first bullet point though. So if your skepticism about short timelines is driven in a big way by thinking that no credible person outside EA or companies invested in AI think this is plausible, then I am curious what you make of this.

Evan R. Murphy 29 Mar 2023 6:20 UTC
19 points
5
in reply to: Ben Pace’s comment on: FLI open letter: Pause giant AI experiments
The letter says to pause for at least 6 months, not exactly 6 months.
So anyone who doesn’t believe that protocols exist to ensure the safety of more capable AI systems shouldn’t avoid signing the letter for that reason, because the letter can be interpreted as supporting an indefinite pause in that case.

Evan R. Murphy 18 Sep 2022 5:58 UTC
LW: 19 AF: 7
9
AF
on: Prize and fast track to alignment research at ALTER
50,000 USD, to be awarded for the best substantial contribution to the learning-theoretic AI alignment research agenda among those submitted before October 1, 2023
I like how you posted this so far in advance of the deadline (over 1 year).
Some contests and prizes that have been posted here in the past have a pretty tight turnaround. By the time I learned about them and became interested in participating (not necessarily the first time I heard about it), their deadlines had already passed.

Surprised by ELK report’s counterexample to Debate, IDA

Evan R. Murphy4 Aug 2022 2:12 UTC

18 points

0 comments5 min readLW link

Evan R. Murphy 12 Apr 2022 7:28 UTC
18 points
on: The Regulatory Option: A response to near 0% survival odds
This is a clever idea, using what’s often considered a bug of regulation as a feature instead. I need to think on this, not sure whether I think it’s a good approach or not yet...
Simply put, AI Alignment has failed.
I do think this is an overstatement. There’s no misaligned AGI yet that I’m aware of, so how has alignment failed? I agree with the thrust of what you were saying though that it feels needlessly risky bet everything on the technical alignment lever when the governance & strategy lever is available too.

Evan R. Murphy 9 Apr 2022 6:44 UTC
18 points
on: It’s time for EA leadership to pull the fast-takeoff fire alarm.
A couple more thoughts on this post which I’ve spent a lot of today thinking about and discussing with folks:
1. This post was good for generating a lot of discussion and engagement on the topic, but it’d be great to have some more careful, thorough systematic analysis of the arguments and implications presented. This post seems to be arguing for short timelines and at least a medium-fast takeoff (which I tend to agree with), but then it argues for mass advocacy as a result.
  
  This is the opposite kind of intervention that makes sense to me and that Holden argues for in this kind of takeoff scenario: ‘Faster and less multipolar takeoff dynamics tend to imply that we should focus on very “direct” interventions aimed at helping transformative AI go well: working on the alignment problem in advance, caring a lot about the cultures and practices of AI labs and governments that might lead the way on transformative AI, etc.’ This is from his Important, actionable questions for the most important century doc.
  
  A more careful and complete analysis should be framed in answer to the ‘Questions about AI “takeoff dynamics”’ from that doc by someone who can commit the time and thought to it.
2. While we should look for strong evidence and strong arguments about timelines and takeoff, we shouldn’t be surprised not to be able to arrive at consensus about it. Given that this post is about pulling the “fire alarm” I’m kind of surprised no one here has linked yet to MIRI’s very aptly titled There’s No Fire Alarm for Artificial General Intelligence:
  
  ”When I observe that there’s no fire alarm for AGI, I’m not saying that there’s no possible equivalent of smoke appearing from under a door. What I’m saying rather is that the smoke under the door is always going to be arguable; it is not going to be a clear and undeniable and absolute sign of fire; and so there is never going to be a fire alarm producing common knowledge that action is now due and socially acceptable. [...] There is never going to be a time before the end when you can look around nervously, and see that it is now clearly common knowledge that you can talk about AGI being imminent, and take action and exit the building in an orderly fashion, without fear of looking stupid or frightened.”

Introduction to the sequence: Interpretability Research for the Most Important Century

Evan R. Murphy12 May 2022 19:59 UTC

16 points

0 comments8 min readLW link

Evan R. Murphy 26 Oct 2021 0:44 UTC
LW: 16 AF: 5
0
AF
on: Preface to the sequence on iterated amplification
Is Iterated Amplification still a current alignment paradigm that’s being pursued?
I found this sequence through the FAQ under How do I get started in AI Alignment research? . I’ve really enjoyed reading the first few articles, but then I noticed a lot of the articles are from 2018. I found this Mar 2021 article also by Paul Christiano which makes it sound like he found some issues with Iterated Amplification and moved onto a different paradigm called Imitative Generalization.

Evan R. Murphy

How do I get started in AI Alignment research?