Daniel Kokotajlo comments on The best simple argument for Pausing AI?

Daniel Kokotajlo 1 Jul 2025 21:55 UTC
6 points
0
Would you agree that AI R&D and datacenter security are safety-critical domains?

(Not saying such deployment has started yet, or at least not to a sufficient level to be concerned. But e.g. I would say that if you are going to have loads of very smart AI agents doing lots of autonomous coding and monitoring of your datacenters, analogous to as if they were employees, then they pose an ‘insider threat’ risk, and could potentially e.g. sabotage their successor systems or the alignment or security work happening in the company. Misalignments in these sorts of AIs could, in various ways, end up causing misalignments in successor AIs. During an intelligence explosion / period of AI R&D automation, this could result in misaligned ASI. True, such ASI would not be deployed outside the datacenter yet, but I think the point to intervene is before then, rather than after.)
- boazbarak 1 Jul 2025 23:15 UTC
  6 points
  0
  Parent
  I think “AI R&D” or “datacenter security” are a little too broad.
  
  I can imagine cases where we could deploy even existing models as an extra layer for datacenter security (e.g. anomaly detection). As long as this is for adding security (not replacing humans), and we are not relying on 100% success of this model, then this can be a positive application, and certainly not one that should be “paused.”
  
  With AI R&D again the question is how you deploy it, if you are using a model in containers supervised by human employees then that’s fine. If you are letting them autonomously carry out large scale training runs with little to no supervision that is a completely different matter.
  
  At the moment, I think the right mental model is to think of current AI models as analogous to employees that have a certain skill profile (which we can measure via evals etc..) and also with some small probability could do something completely crazy. With appropriate supervision, such employees could also be useful, but you would not fully trust them with sensitive infrastructure.
  
  As I wrote in my essay, I think the difficult point would be if we get to the “alignment uncanny valley”—alignment is at sufficiently good level (e.g., probability of failure be small enough) so that people are actually tempted to entrust models with such sensitive tasks, but we don’t have strong control of this probability to ensure we can drive it arbitrarily close to zero, and so there are risks of edge cases.