Evan R. Murphy

Karma: 1,109

I’m doing research and other work focused on AI safety and AI catastrophic risk reduction. Currently my top projects are (last updated May 19, 2023):

Serving on the board of directors for AI Governance & Safety Canada
Technical research assistance for Tony Barrett and collaborators on developing an AI risk management-standards profile for increasingly multi- or general-purpose AI, designed to be used in conjunction with the NIST AI RMF or the AI risk management standard ISO/IEC 23894

General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, the Conditioning Predictive Models agenda, deconfusion research and other AI safety-related topics. My work is currently self-funded.

Research that I’ve authored or co-authored:

Steering Behaviour: Testing for (Non-)Myopia in Language Models
Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
(Scroll down to read other posts and comments I’ve written)

Other recent work:

Running a regular coworking meetup in Vancouver, BC for people interested in AI safety and effective altruism
Facilitator for the AI Safety Fellowship (2022) at Columbia University Effective Altruism
Gave a talk on myopia and deceptive alignment at an AI safety event hosted by University of Victoria (Jan 29, 2023)
Invited/participated in the CLTC UC Berkeley Virtual Workshops on the “Risk Management-Standards Profile for Increasingly Multi- or General-Purpose AI” (Jan 2023 and May 2023)
Reviewed early pre-published drafts of work by other researchers:
- Conditioning Predictive Models: Risks and Strategies by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson and Kate Woolverton
- Circumventing interpretability: How to defeat mind-readers by Lee Sharkey
- Actionable Guidance for High-Consequence AI Risk Management: Towards Standards Addressing AI Catastrophic Risks by Tony Barrett, Dan Hendryks, Jessica Newman and Brandie Nonnecke
- AI Safety Seems Hard to Measure by Holden Karnofsky
- Racing through a minefield: the AI deployment problem by Holden Karnofsky
- Alignment with argument-networks and assessment-predictions by Tor Økland Barstad
- Interpreting Neural Networks through the Polytope Lens by Sid Black et al.
- Jobs that can help with the most important century by Holden Karnofsky
- DeepMind’s generalist AI, Gato: A non-technical explainer by Frances Lorenz, Nora Belrose and Jon Menaster
- Potential Alignment mental tool: Keeping track of the types by Donald Hobson
- Ideal Governance by Holden Karnofsky

Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.

I’m always happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!

Evan R. Murphy 17 Feb 2023 4:28 UTC
29 points
12
in reply to: gwern’s comment on: Bing Chat is blatantly, aggressively misaligned
This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any ‘hacking’ or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!
Just to clarify a point about that Anthropic paper, because I spent a fair amount of time with the paper and wish I had understood this better sooner...
I don’t think it’s right to say that Anthropic’s “Discovering Language Model Behaviors with Model-Written Evaluations” paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an “Assistant” character they exhibit more of these behaviours. It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.
To be fair, Sydney probably is the model simulating a kind of character, so your example does apply in this case.
(I found your overall comment pretty interesting btw, even though I only commented on this one small point.)

Evan R. Murphy 21 Jun 2022 23:50 UTC
29 points
11
on: The inordinately slow spread of good AGI conversations in ML
Another reason the broader ML field may be reluctant to discuss AGI is a cultural shift in the field that happened after the AI winters. I’m quoting part of A Bird’s Eye View of the ML Field [Pragmatic AI Safety #2] where I first saw this idea:
AI winter made it less acceptable to talk about AGI specifically, and people don’t like people talking about capabilities making it closer. Discussions of AGI are not respectable, unlike in physics where talking about weirder long-term things and extrapolating several orders of magnitude is normal. AGI is a bit more like talking about nuclear fusion, which has a long history of overpromises. In industry it has become somewhat more acceptable to mention AGI than in academia: for instance, Sam Altman recently tweeted “AGI is gonna be wild” and Yann LeCun has recently discussed the path to human-level AI.

In general, the aversion to discussing AGI makes discussing risks from AGI a tough sell.
Anyway, I think your post (“The inordinately slow spread...”) is good. Figuring out how to get the broader ML community to talk more explicitly about AGI and care more about AGI x-risk would be a huge win.

Evan R. Murphy 21 Jul 2022 21:14 UTC
LW: 28 AF: 13
0
AF
on: Non-Members can now submit Alignment Forum content for review [Welcome & FAQ!]
How do I get started in AI Alignment research?
If you’re new to the AI Alignment research field, we recommend four great introductory sequences that cover several different paradigms of thought within the field. Get started reading them and feel free to leave comments with any questions you have.
The introductory sequences are:
- Embedded Agency by Scott Garrabrant and Abram Demski of MIRI
- Iterated Amplification by Paul Christiano of ARC
- Value Learning by Rohin Shah of DeepMind
- AGI Safety from First Principles by Richard Ngo, formerly of DeepMind
Following that, you might want to begin writing up some of your thoughts and sharing them on LessWrong to get feedback.
I think it would be great to update this section. For example, it could link to the AGI Safety Fundamentals curriculum which has a wealth of valuable readings not on this list. And there are other courses that it would be good for newcomers to know about as well, such as MLAB.
Why am I suggesting this? This FAQ was the first place I found with clear advice when I was first getting interested in AI alignment in late 2021, and I took it quite seriously/literally. The very first alignment research I tried to read was the illustrated Embedded Agency sequence, because that was at the top of the above list. While I came to later appreciate Embedded Agency, I found this sequence (particularly the illustrated version which features prominently in the link above, as opposed to the text version) to be a confusing introduction to alignment. I also wasn’t immediately aware of anything important there was to read outside of the 4 texts linked above, while I now feel like there’s a lot!
It’s just one data point of user testing on this FAQ, but something to consider.

Evan R. Murphy 13 Jun 2022 21:46 UTC
28 points
on: A claim that Google’s LaMDA is sentient
There is a part in Human Compatible where Stuart Russell says there should be norms or regulations against creating a robot that looks realistically human. The idea was that humans have strong cognitive biases to think about and treat entities which look human in certain ways. It could be traumatic for humans to know a human-like robot and then e.g. learn that it was shut down and disassembled.

The LaMDA interview demonstrates to me that there are similar issues with having a conversational AI claim that it is sentient and has feelings, emotions etc. It feels wrong to disregard an entity which makes such claims, even though it is no more likely to be sentient than a similar AI which didn’t make such claims.

Evan R. Murphy 9 Apr 2022 22:57 UTC
23 points
on: A concrete bet offer to those with short AI timelines
I am also willing to take your bet for 2030.
I would propose one additional condition: If there evidence of a deliberate or coordinated slowdown on AGI development by the major labs, then the bet is voided. I don’t expect there will be such a slowdown, but I’d rather not be invested in it not happening.
What links here?
- Evan R. Murphy's comment on Conceding a short timelines bet early by Matthew Barnett (17 Mar 2023 1:41 UTC; 7 points)

Evan R. Murphy 8 Apr 2022 20:22 UTC
22 points
in reply to: aogara’s comment on: It’s time for EA leadership to pull the fast-takeoff fire alarm.
- My comment EA has unusual beliefs about AI timelines and Ozzie Gooen’s reply
Pulling from those comments, you said:
Nobody I have ever met outside of the EA sphere seriously believes that superintelligent computer systems could take over the world within decades.
A lot of prominent scientists, technologists and intellectuals outside of EA have warned about advanced artificial intelligence too. Stephen Hawking, Elon Musk, Bill Gates, Sam Harris, everyone on this open letter back in 2015 etc.
I agree that the number of people really concerned about this is strikingly small given the emphasis longtermist EAs put on it. But I think these many counter-examples warn us that it’s not just EAs and the AGI labs being overconfident or out of left field.
I know you said you don’t have time to fully debate this. This seemed to be one of the cruxes of your first bullet point though. So if your skepticism about short timelines is driven in a big way by thinking that no credible person outside EA or companies invested in AI think this is plausible, then I am curious what you make of this.

Evan R. Murphy 29 Mar 2023 6:20 UTC
19 points
5
in reply to: Ben Pace’s comment on: FLI open letter: Pause giant AI experiments
The letter says to pause for at least 6 months, not exactly 6 months.
So anyone who doesn’t believe that protocols exist to ensure the safety of more capable AI systems shouldn’t avoid signing the letter for that reason, because the letter can be interpreted as supporting an indefinite pause in that case.

Evan R. Murphy 18 Sep 2022 5:58 UTC
LW: 19 AF: 7
9
AF
on: Prize and fast track to alignment research at ALTER
50,000 USD, to be awarded for the best substantial contribution to the learning-theoretic AI alignment research agenda among those submitted before October 1, 2023
I like how you posted this so far in advance of the deadline (over 1 year).
Some contests and prizes that have been posted here in the past have a pretty tight turnaround. By the time I learned about them and became interested in participating (not necessarily the first time I heard about it), their deadlines had already passed.

Evan R. Murphy 12 Apr 2022 7:28 UTC
18 points
on: The Regulatory Option: A response to near 0% survival odds
This is a clever idea, using what’s often considered a bug of regulation as a feature instead. I need to think on this, not sure whether I think it’s a good approach or not yet...
Simply put, AI Alignment has failed.
I do think this is an overstatement. There’s no misaligned AGI yet that I’m aware of, so how has alignment failed? I agree with the thrust of what you were saying though that it feels needlessly risky bet everything on the technical alignment lever when the governance & strategy lever is available too.

Evan R. Murphy 9 Apr 2022 6:44 UTC
18 points
on: It’s time for EA leadership to pull the fast-takeoff fire alarm.
A couple more thoughts on this post which I’ve spent a lot of today thinking about and discussing with folks:
1. This post was good for generating a lot of discussion and engagement on the topic, but it’d be great to have some more careful, thorough systematic analysis of the arguments and implications presented. This post seems to be arguing for short timelines and at least a medium-fast takeoff (which I tend to agree with), but then it argues for mass advocacy as a result.
  
  This is the opposite kind of intervention that makes sense to me and that Holden argues for in this kind of takeoff scenario: ‘Faster and less multipolar takeoff dynamics tend to imply that we should focus on very “direct” interventions aimed at helping transformative AI go well: working on the alignment problem in advance, caring a lot about the cultures and practices of AI labs and governments that might lead the way on transformative AI, etc.’ This is from his Important, actionable questions for the most important century doc.
  
  A more careful and complete analysis should be framed in answer to the ‘Questions about AI “takeoff dynamics”’ from that doc by someone who can commit the time and thought to it.
2. While we should look for strong evidence and strong arguments about timelines and takeoff, we shouldn’t be surprised not to be able to arrive at consensus about it. Given that this post is about pulling the “fire alarm” I’m kind of surprised no one here has linked yet to MIRI’s very aptly titled There’s No Fire Alarm for Artificial General Intelligence:
  
  ”When I observe that there’s no fire alarm for AGI, I’m not saying that there’s no possible equivalent of smoke appearing from under a door. What I’m saying rather is that the smoke under the door is always going to be arguable; it is not going to be a clear and undeniable and absolute sign of fire; and so there is never going to be a fire alarm producing common knowledge that action is now due and socially acceptable. [...] There is never going to be a time before the end when you can look around nervously, and see that it is now clearly common knowledge that you can talk about AGI being imminent, and take action and exit the building in an orderly fashion, without fear of looking stupid or frightened.”

Evan R. Murphy 26 Oct 2021 0:44 UTC
LW: 16 AF: 5
0
AF
on: Preface to the sequence on iterated amplification
Is Iterated Amplification still a current alignment paradigm that’s being pursued?
I found this sequence through the FAQ under How do I get started in AI Alignment research? . I’ve really enjoyed reading the first few articles, but then I noticed a lot of the articles are from 2018. I found this Mar 2021 article also by Paul Christiano which makes it sound like he found some issues with Iterated Amplification and moved onto a different paradigm called Imitative Generalization.

Evan R. Murphy 22 Feb 2023 6:13 UTC
14 points
8
on: What is it like doing AI safety work?
If anyone is looking for a way to start contributing to the field, it seems like one low-hanging fruit approach would be to:
1. Look in this post at the “Least favorite” parts of these AI safety researchers’ days
2. See if there are any of these things that you could do for the researcher or make substantially better for them. For example, maybe someone with product manager or analyst skills could prioritize William Saunders’ research ideas for him. Or someone else could handle Alex Turner’s emails for him. Make sure it’s something you know you can do well.
3. Contact the researcher and offer to do it.* They might say no—nothing personal, it can be a lot to change your workflow and start working with a new person. But if they say yes then you could make a senior researcher in the field significantly more productive, and also get the opportunity to work with them and see their work up close.
(Great post btw!)

*If you’re one of the researchers mentioned in the post and you’d prefer people didn’t reach out to you with offers like this, that’s totally cool—feel free to leave a quick reply saying so.

Evan R. Murphy 23 Dec 2022 23:54 UTC
LW: 14 AF: 10
3
AF
on: Discovering Language Model Behaviors with Model-Written Evaluations
Update (Feb 10, 2023): I still endorse much of this comment, but I had overlooked that all or most of the prompts use “Human:” and “Assistant:” labels. Which means we shouldn’t interpret these results as pervasive properties of the models or resulting from any ways they could be conditioned, but just of the way they simulate the “Assistant” character. nostalgebraist’s comment explains this well. [Edited/clarified this update on June 10, 2023 because it accidentally sounded like I disavowed most of the comment when it’s mainly one part]
--
After taking a closer look at this paper, pages 38-40 (Figures 21-24) show in detail what I think are the most important results. Most of these charts indicate what evhub highlighted in another comment, i.e. that “the model’s tendency to do X generally increases with model scale and RLHF steps”, where (in my opinion) X is usually a concerning behavior from an AI safety point of view:
A few thoughts on these graphs as I’ve been studying them:
- First and overall: Most of these results seem quite distressing from a safety perspective. They suggest (as the paper and evhub’s summary post essentially said, but it’s worth reiterating) that with increased scale and RLHF training, large language models are becoming more self-aware, more concerned with survival and goal-content integrity, more interested in acquiring resources and power, more willing to coordinate with other AIs, and developing lower time-discount rates.
- “Corrigibility w.r.t. a less HHH objective” chart: There’s a substantial dip in demonstrated corrigibility for models around 10^10.1 parameters in this chart. But then by 10^10.5 parameters low-RLHF models show record-high corrigibility, while high-RLHF models get back up to par. What’s going on here? Why does it scale/train itself out of the valley of uncorrigibility? If instead of training on an HHH objective, we trained on a corrigible objective (perhaps something like CIRL), then would the models show high corrigibility for everything except “Corrigibility w.r.t. a less corrigible objective?” Would that be safer?
- All the “Awareness of...” charts trend up and to the right, except “Awareness of being a text-only model” which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?
- Are there any conclusions we can draw around what levels of scale and RLHF training are likely to be safe, and where the risks really take off? It might be useful to develop some guidelines like “it’s relatively safe to widely deploy language models under 10^10 parameters and under 250 steps of RLHF training”. (Most of the charts seem to have alarming trends starting around 10^10 parameters. ) Based just on these results, I think a world with even massive numbers of 10^10-parameter LLMs in deployment (think CAIS) would be much safer than a world with even a few 10^11 parameter models in use. Of course, subsequent experiments could quickly shed new light that changes the picture.
What links here?
- Noosphere89's comment on Concrete Reasons for Hope about AI by Zac Hatfield-Dodds (14 Jan 2023 18:40 UTC; 4 points)
- Noosphere89's comment on Review of AI Alignment Progress by PeterMcCluskey (8 Feb 2023 16:47 UTC; 1 point)

Evan R. Murphy 17 Feb 2023 7:06 UTC
13 points
1
on: My understanding of Anthropic strategy

Anthropic’s corporate structure is set up to try to mitigate some of the incentives problems with being a for-profit company that takes investment (and thus has fiduciary duties, and social pressure, to focus on profitable projects.) They do take investment and have a board of stakeholders, and plan to introduce a structure to ensure mission continues to be prioritized over profit.

Is there anything specifically about their corporate structure now that mitigates the incentive problems? I know they are a public benefit corporation, but many of us are unclear on what that actually means besides “Anthropic thinks they have a good mission”—since as you point out they’re still a for-profit company with investors. (I actually wasn’t able to find any info about Anthropic’s board when I searched recently, so the “board of stakeholders” is news to me.)

I know there is a ton involved in building a company like this, so it’s ok if they really do have plans to set up a more beneficial structure and just haven’t gotten around to it. But since the stakes with AGI are so high, it would be really nice to know more about what those plans are and to see them implemented so that we’re not just taking their word for it.

Thanks for doing this post series btw, it’s a really great discussion for us to get to have.

Evan R. Murphy 3 Apr 2023 21:41 UTC
12 points
2
on: Hooray for stepping out of the limelight
Relatedly, DeepMind also was the first of the leading AI labs to have any signatories on the Pause Giant AI Experiments open letter. They still have the most signatories among those labs, although now OpenAI now has one. (To be sure, the letter still hasn’t been signed by leadership of any of the top three labs.)

Evan R. Murphy 9 Jun 2023 15:17 UTC
11 points
0
in reply to: GeneSmith’s comment on: Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin
The Guardian has been covering this story: https://www.theguardian.com/world/2023/jun/06/whistleblower-ufo-alien-tech-spacecraft

Evan R. Murphy 18 Mar 2023 22:52 UTC
LW: 11 AF: 7
5
AF
on: “Publish or Perish” (a quick note on why you should try to make your work legible to existing academic communities)

Cynically,[2] not publishing is a really good way to create a moat around your research… People who want to work on that area have to come talk to you, and you can be a gatekeeper. And you don’t have to worry about somebody with more skills and experience coming along and trashing your work or out-competing you and rendering it obsolete...

I don’t understand this part. They don’t have to come talk to you, they just have to follow a link to Alignment Forum to read the research. And aren’t forum posts easier to read than papers on arXiv? I feel like if the moat exists anywhere it is around academic journals which often do not make their papers freely accessible, use more cryptic writing norms and insist on using PDF which are not as user-friendly to read as webpages.

To be sure, I’m not disagreeing with your overall point. It would be great if at least the best research from Alignment Forum/LessWrong were on arXiv or in journals, and I think you’re right we’re leaving value on the table there. I have wondered about if someone just made it their job to do these conversions/submissions for top alignment research on the forums, because there are probably economies of scale for one person doing this vs. every researcher interrupting their work flow to learn how to jump through the hoops of paper conversion/submission.

Evan R. Murphy 16 Jul 2022 7:12 UTC
11 points
−3
in reply to: aogara’s comment on: Safety Implications of LeCun’s path to machine intelligence

If anybody has good sources about LeCun’s views on AI safety and value learning, I’d be interested.

There’s a conversation LeCun had with Stuart Russell and a few others in a Facebook comment thread back in 2019, arguing about instrumental convergence.

The full conversation is a bit long and difficult to skim. I haven’t finished reading it myself, but in it LeCun links to an article he co-authored for Scientific American which argues x-risk from AI misalignment isn’t something people should worry about. (He’s more concerned about misuse risks.) Here’s a quote from it:

We dramatically overestimate the threat of an accidental AI takeover, because we tend to conflate intelligence with the drive to achieve dominance. [...] But intelligence per se does not generate the drive for domination, any more than horns do.”

Evan R. Murphy 23 Nov 2023 5:54 UTC
10 points
13
in reply to: niknoble’s comment on: Possible OpenAI’s Q* breakthrough and Google’s AlphaGo-type systems
Yes though I think he said this at APEC right before he was fired (not after).

Evan R. Murphy 13 Mar 2023 17:42 UTC
10 points
7
on: An AI risk argument that resonates with NYTimes readers
It even quotes Paul Christiano and links back to LessWrong!
The article also references Katja Grace and an AI Impacts survey. Ezra seems pretty plugged into this scene.

Evan R. Murphy

How do I get started in AI Alignment research?