AI safety & alignment researcher
In Rob Bensinger’s typology: AGI-wary/alarmed, welfarist, and eventualist.
Public stance: AI companies are doing their best to build ASI (AI much smarter than humans), and have a chance of succeeding. No one currently knows how to build ASI without an unacceptable level of existential risk (> 5%). Therefore, companies should be forbidden from building ASI until we know how to do it safely.
I have signed no contracts or agreements whose existence I cannot mention.
Functional Self: updates and newer related work
At some point I hope to write a follow-up post to ‘On the functional self of LLMs’. That won’t be in the next couple of months, though. Today someone contacted me asking about an update and related work; what follows is my (mildly edited) reply:
My thinking on the topic has shifted in one significant way since I wrote that post. While I continue to believe that there are things going on in LLMs that don’t fit comfortably into ‘the model is the assistant’, I’m more skeptical that it’s as coherent as a functional self. I now think that the FS post failed to fully reckon with the range of other possibilities there; eg one simple one would be that non-assistant behavior comes from a collection of shallow learned heuristics for specific situations; another would be that other (ie non-assistant) personas come into play under some circumstances.
This area of research has blossomed since the FS post[1]; as it happens, the best thing to read on the topic (IMO) was published literally three days ago. It’s from researchers at Anthropic (Sam Marks, Jack Lindsey, Chris Olah) and tries to grapple with the same questions: The Persona Selection Model (also discussion on LW)
That said, here are others that I also think are pretty important for this topic. If you’re confident you’re going to read them all, consider doing them in this order:
janus, Simulators (unless you’re fairly confident you’ve already absorbed what it has to say from later work).
nostalgebraist, The Void (also LW).
Anthropic’s new constitution for Claude—more optional; it’s interesting specifically because it takes a really promising new approach to addressing the problems raised in the nostalgebraist piece.
And then the Persona Selection Model piece linked above.
If you’re not sure you’re going to read them all in the near future, I’d prioritize the new Marks et al piece, then Simulators, then The Void.
As it happens none of those are peer-reviewed publications; this topic has been considered somewhat on the fringe until recently. That’s been changing quickly over the past six months or so; two recent papers I would strongly recommend are
Christina Lu et al, The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
Sharan Maiya et al, Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI, which open sources a codebase and a set of models with different personas, which opens up a lot of great research directions on what is and isn’t part of the persona.
You might also be interested in Concrete research ideas on AI personas, which suggests some research directions.
As you can see, the framing on most of these is about personas, rather than the relationship between LLMs and their personas, or LLMs’ self-models, or the parts of LLMs that aren’t necessarily personas; that’s the framing the field seems to have converged on (at least for now) for the cluster of subjects which includes both personas and those broader questions.
Not necessarily because of the FS post, although I hope it helped nudge some people to think more about these questions.