I’m doing research and other work focused on AI safety and AI catastrophic risk reduction. Currently my top projects are (last updated May 19, 2023):
Serving on the board of directors for AI Governance & Safety Canada
Technical research assistance for Tony Barrett and collaborators on developing an AI risk management-standards profile for increasingly multi- or general-purpose AI, designed to be used in conjunction with the NIST AI RMF or the AI risk management standard ISO/IEC 23894
General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, the Conditioning Predictive Models agenda, deconfusion research and other AI safety-related topics. My work is currently self-funded.
Research that I’ve authored or co-authored:
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
(Scroll down to read other posts and comments I’ve written)
Other recent work:
Running a regular coworking meetup in Vancouver, BC for people interested in AI safety and effective altruism
Facilitator for the AI Safety Fellowship (2022) at Columbia University Effective Altruism
Gave a talk on myopia and deceptive alignment at an AI safety event hosted by University of Victoria (Jan 29, 2023)
Invited/participated in the CLTC UC Berkeley Virtual Workshops on the “Risk Management-Standards Profile for Increasingly Multi- or General-Purpose AI” (Jan 2023 and May 2023)
Reviewed early pre-published drafts of work by other researchers:
Conditioning Predictive Models: Risks and Strategies by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson and Kate Woolverton
Circumventing interpretability: How to defeat mind-readers by Lee Sharkey
Actionable Guidance for High-Consequence AI Risk Management: Towards Standards Addressing AI Catastrophic Risks by Tony Barrett, Dan Hendryks, Jessica Newman and Brandie Nonnecke
AI Safety Seems Hard to Measure by Holden Karnofsky
Racing through a minefield: the AI deployment problem by Holden Karnofsky
Alignment with argument-networks and assessment-predictions by Tor Økland Barstad
Interpreting Neural Networks through the Polytope Lens by Sid Black et al.
Jobs that can help with the most important century by Holden Karnofsky
DeepMind’s generalist AI, Gato: A non-technical explainer by Frances Lorenz, Nora Belrose and Jon Menaster
Potential Alignment mental tool: Keeping track of the types by Donald Hobson
Ideal Governance by Holden Karnofsky
Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.
I’m always happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!
Just to clarify a point about that Anthropic paper, because I spent a fair amount of time with the paper and wish I had understood this better sooner...
I don’t think it’s right to say that Anthropic’s “Discovering Language Model Behaviors with Model-Written Evaluations” paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an “Assistant” character they exhibit more of these behaviours. It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.
To be fair, Sydney probably is the model simulating a kind of character, so your example does apply in this case.
(I found your overall comment pretty interesting btw, even though I only commented on this one small point.)