(Posting in a personal capacity unless stated otherwise.) I help allocate Open Phil’s resources to improve the governance of AI with a focus on avoiding catastrophic outcomes. Formerly co-founder of the Cambridge Boston Alignment Initiative, which supports AI alignment/safety research and outreach programs at Harvard, MIT, and beyond, co-president of Harvard EA, Director of Governance Programs at the Harvard AI Safety Team and MIT AI Alignment, and occasional AI governance researcher.
Not to be confused with the user formerly known as trevor1.
Coming back >2.5 years later to say this is among the most helpful pieces of AI writing I’ve ever read—I remember it being super clarifying at the time, and I still link people to it, cite it in conversation, and use similar analogies. (Even though I now also caveat it with ”...but also maybe really sophisticated agents will actively seek it, because the training environment might reward it, and maybe they’ll ‘experience’ something like fear/pain/etc for things correlated with negative reward, if they experience things...”) Thank you for writing it up!!