Daniel Kokotajlo comments on Six Thoughts on AI Safety

Daniel Kokotajlo 25 Jan 2025 5:27 UTC
4 points
0
Thanks for taking the time to think and write about this important topic!
Here are some point-by-point comments as I read:
(Though I suspect translating these technical capabilities to the economic and societal impact we associate with AGI will take significantly longer.)
I think it’ll take an additional 0 to 5 years roughly. More importantly though, I think that the point to intervene on—the time when the most important decisions are being made—is right around the time of AGI. By the time you have ASI, and certainly by the time you are deploying ASI into the economy, you’ve probably fallen into one of the two stable attractor states I describe here. Which one you fall into depends on choices made earlier, e.g. how much alignment talent you bring into the project, the extent to which that talent is optimistic vs. appropriately paranoid, the time you give them to let them cook with the models, the resources you give them (% of total compute, say in overall design strategy), etc.
This assumes that our future AGIs and ASIs will be, to a significant extent, scaled-up versions of our current models. On the one hand, this is good news, since it means our learnings from current models are relevant for more powerful ones, and we can develop and evaluate safety techniques using them. On the other hand, this makes me doubt that safety approaches that do not show signs of working for our current models will be successful for future AIs.
I agree that future AGIs and ASIs will be to a significant extent scaled up versions of current models (at least at first; I expect the intelligence explosion to rapidly lead to additional innovations and paradigm shifts). I’m not sure what you are saying with the other sentences. Sometimes when people talk about current alignment techniques working, what they mean is ‘causes current models to be better at refusals and jailbreak resistance’ which IMO is a tangentially related but importantly different problem from the core problem(s) we need to solve in order to end up in the good attractor state. After all, you could probably make massive progress on refusals and jailbreaks simply by making the models smarter, without influencing their goals/values/principles at all.
Oh wait I just remembered I can comment directly on the text with a bunch of little comments instead of making one big comment here—I’ll switch to that henceforth.
Cheers!
- boazbarak 26 Jan 2025 14:48 UTC
  4 points
  0
  Parent
  Thanks all for commenting! Just quick apology for being behind on responding but I do plan to get to it!