This came from a Facebook thread where I argued that many of the main ways AI was described as failing fall into few categories (John disagreed).
I appreciated this list, but they strike me as fitting into a few clusters.
...I would flag that much of that is unsurprising to me, and I think categorization can be pretty fine.
In order:
1) If an agent is unwittingly deceptive in ways that are clearly catastrophic, and that could be understood by a regular person, I’d probably put that under the “naive” or “idiot savant” category. As in, it has severe gaps in its abilities that a human or reasonable agent wouldn’t. If the issue is that all reasonable agents wouldn’t catch the downsides of a certain plan, I’d probably put that under the “we made a pretty good bet given the intelligence that we had” category.
2) I think that “What Failure Looks Like” is less Accident risk, more “Systemic” risk. I’m also just really unsure what to think about this story. It feels to me like it’s a situation where actors are just not able to regulate externalities or similar.
3) The “fusion power generator scenario” seems like just a bad analyst to me. A lot of the job of an analyst is to flag important considerations. This seems like a pretty basic ask. For this itself to be the catastrophic part, I think we’d have to be seriously bad at this. (“i.e. Idiot Savant”)
4) STEM-AGI → I’d also put this in the naive or “idiot savant” category.
5) “that plan totally fails to align more-powerful next-gen AGI at all” → This seems orthogonal to “categorizing the types of unalignment”. This describes how incentives would create an unaligned agent, not what the specific alignment problem is. I do think it would be good to have better terminology here, but would probably consider it a bit adjacent to the specific topic of “AI alignment”—more like “AI alignment strategy/policy” or something.
6) “AGIs act much like a colonizing civilization” → This sounds like either unalignment has already happened, or humans just gave AIs their own power+rights for some reason. I agree that’s bad, but it seems like a different issue than what I think of as the alignment problem. More like, “Yea, if unaligned AIs have a lot of power and agency and different goals, that would be suboptimal”
7) “but at some point a particular subagent starts self-improving, goes supercritical, and takes over the rest of the system overnight.” → This sounds like a traditional mesa-agent failure. I expect a lot of “alignment” with a system made of a bunch of subcomponents is “making sure no subcomponents do anything terrible.” Also, still leaves open the specific way this subsystem becomes/is unaligned.
8 ) “using an LLM to simulate a whole society. ” → Sorry, I don’t quite follow this one.
Personally, I like the focus “scheming” has. At the same time, I imagine there are another 5 to 20 clean concerns we should also focus on (some of which have been getting attention).
While I realize there’s a lot we can’t predict, I think we could do a much better just making lists of different risk factors and allocating research amongst them.
This came from a Facebook thread where I argued that many of the main ways AI was described as failing fall into few categories (John disagreed).
I appreciated this list, but they strike me as fitting into a few clusters.
Personally, I like the focus “scheming” has. At the same time, I imagine there are another 5 to 20 clean concerns we should also focus on (some of which have been getting attention).
While I realize there’s a lot we can’t predict, I think we could do a much better just making lists of different risk factors and allocating research amongst them.