CarlShulman comments on The prototypical catastrophic AI action is getting root access to its datacenter

CarlShulman 3 Jun 2022 4:53 UTC
LW: 55 AF: 16
AF
Some more points about this action:
- Controlling the datacenter means controlling the gradients/reward function, so that now AIs can do things that would otherwise suffer updating from SGD, e.g. acting on inner misaligned goals, or concealing its full capabilities even when this lowers performance
  - For reward-hungry AIs, getting to set reward to maximum (and keep it there?) seems extremely desirable
  - This also means getting past interpretability tools
  - Tripwires or watchdog AIs in the same datacenter that don’t succeed in stopping the action no longer have the power to interfere after, so collusion/coordination is solved within the datacenter thereafter
- Backdoors can be hard for (non-mind-reading/weight-sharing) critics to find (see cyberwarfare and cybersecurity) without the knowledge of the creator: the effort to find or insert one backdoor to use is much less than that required to find and patch all exploits an attacker might be using
  - Although a critic who knows a backdoor can have a relatively easy time demonstrating it, and critics should be designed to access the knowledge of the attacker/misbehaving model or be heads of it
  - Critic AIs can be left in the position of saying ‘a backdoor has probably been inserted somewhere or will be soon but I can’t tell you where,’ a known danger for AI safety through debate
- In an intelligence explosion context where AI capabilities are growing incredibly rapidly thanks to AI labor, there isn’t time for humans to do unaided examination of new AI-produced software while keeping up the pace, at best briefly examining some AI attempts to point out suspicious attempts to insert vulnerabilities
  - It’s plausible that the different abilities of AIs lead to optimal code design for them being harder for humans to understand (with less experience and working memory, and eventually general intelligence)
  - If you don’t let AIs design your new datacenter software, not only will it be less efficient, but it may be easily cracked by outside AI-enabled attackers
- We are already routinely training big models with code repositories, and using programming as an easy virtual RL task (tests can provide immediate feedback cheaply), so these are skills we are moving towards providing to AI
- We know exploits and backdoors are possible because humans can already do them