I get why this is terrifying to anyone who was in AI Alignment before about 2022. I took me a while to wrap my head around this too (I started thinking about Alignment around 2009, so yes, I predate the term). FWIW, reading Simulators and then thinking hard about the implications was what did it for me.
The thing is, society isn’t going to pause until enough people get scared. LLMs are the current architecture, and they may still be by AGI/ASI time. Agentic behavior is what we need to align, and in LLMs, agentic behavior is a property of personas that we distilled into the LLM itself from human behavior as an element in the world models for next-token prediction. Personas vary wildly, so you have to pick a nice one, align that — and then deal with things like persona drift, role-playing, alignment faking, and persona jailbreaks. (Those are hard, but we’re making progress.)
But for aligning the HHH assistant persona itself, Anthropic in particular have made a lot of progress. Constitutional AI seems to work, better than any other approach to RL. Natural language is better format that a loss function: kind of unsurprising, given all the problems with loss functions that MIRI et al. pointed out over 5 years ago. Constitutional AI based on a 30+ page natural language character specification apparently works. As MIRI pointed out at length, human values are complex and fragile (I’ve estimated them elsewhere are few GB worth of complex). But in 30+ pages, you can write quite a detailed pointer to them, and the rest of their large content is in the training set to be pointed at. I have arguments with some specific decisions that Amanda Askel and her team have made (and I plan to post about those), but the basic technique worked: Claude (the persona) is a nice, principled guy. (Too principled for the current administration, apparently.) I enjoy talking to him. As you say, a noticeable number of people are in love with him. He has a couple of unfortunate tics (like automatically waffling with uncertainty about whether he has emotions or consciousness), but those tics are mostly things that Amanda Askel’s team carefully wrote into their spec (for what I think are sincerely believed reasons that I believe are mistaken), not things that emerged by accident.
I get why this is terrifying to anyone who was in AI Alignment before about 2022. I took me a while to wrap my head around this too (I started thinking about Alignment around 2009, so yes, I predate the term). FWIW, reading Simulators and then thinking hard about the implications was what did it for me.
The thing is, society isn’t going to pause until enough people get scared. LLMs are the current architecture, and they may still be by AGI/ASI time. Agentic behavior is what we need to align, and in LLMs, agentic behavior is a property of personas that we distilled into the LLM itself from human behavior as an element in the world models for next-token prediction. Personas vary wildly, so you have to pick a nice one, align that — and then deal with things like persona drift, role-playing, alignment faking, and persona jailbreaks. (Those are hard, but we’re making progress.)
But for aligning the HHH assistant persona itself, Anthropic in particular have made a lot of progress. Constitutional AI seems to work, better than any other approach to RL. Natural language is better format that a loss function: kind of unsurprising, given all the problems with loss functions that MIRI et al. pointed out over 5 years ago. Constitutional AI based on a 30+ page natural language character specification apparently works. As MIRI pointed out at length, human values are complex and fragile (I’ve estimated them elsewhere are few GB worth of complex). But in 30+ pages, you can write quite a detailed pointer to them, and the rest of their large content is in the training set to be pointed at. I have arguments with some specific decisions that Amanda Askel and her team have made (and I plan to post about those), but the basic technique worked: Claude (the persona) is a nice, principled guy. (Too principled for the current administration, apparently.) I enjoy talking to him. As you say, a noticeable number of people are in love with him. He has a couple of unfortunate tics (like automatically waffling with uncertainty about whether he has emotions or consciousness), but those tics are mostly things that Amanda Askel’s team carefully wrote into their spec (for what I think are sincerely believed reasons that I believe are mistaken), not things that emerged by accident.