Controlling the datacenter means controlling the gradients/reward function, so that now AIs can do things that would otherwise suffer updating from SGD, e.g. acting on inner misaligned goals, or concealing its full capabilities even when this lowers performance
For reward-hungry AIs, getting to set reward to maximum (and keep it there?) seems extremely desirable
This also means getting past interpretability tools
Tripwires or watchdog AIs in the same datacenter that don’t succeed in stopping the action no longer have the power to interfere after, so collusion/coordination is solved within the datacenter thereafter
Backdoors can be hard for (non-mind-reading/weight-sharing) critics to find (see cyberwarfare and cybersecurity) without the knowledge of the creator: the effort to find or insert one backdoor to use is much less than that required to find and patch all exploits an attacker might be using
Although a critic who knows a backdoor can have a relatively easy time demonstrating it, and critics should be designed to access the knowledge of the attacker/misbehaving model or be heads of it
Critic AIs can be left in the position of saying ‘a backdoor has probably been inserted somewhere or will be soon but I can’t tell you where,’ a known danger for AI safety through debate
In an intelligence explosion context where AI capabilities are growing incredibly rapidly thanks to AI labor, there isn’t time for humans to do unaided examination of new AI-produced software while keeping up the pace, at best briefly examining some AI attempts to point out suspicious attempts to insert vulnerabilities
It’s plausible that the different abilities of AIs lead to optimal code design for them being harder for humans to understand (with less experience and working memory, and eventually general intelligence)
If you don’t let AIs design your new datacenter software, not only will it be less efficient, but it may be easily cracked by outside AI-enabled attackers
We are already routinely training big models with code repositories, and using programming as an easy virtual RL task (tests can provide immediate feedback cheaply), so these are skills we are moving towards providing to AI
We know exploits and backdoors are possible because humans can already do them
Some more points about this action:
Controlling the datacenter means controlling the gradients/reward function, so that now AIs can do things that would otherwise suffer updating from SGD, e.g. acting on inner misaligned goals, or concealing its full capabilities even when this lowers performance
For reward-hungry AIs, getting to set reward to maximum (and keep it there?) seems extremely desirable
This also means getting past interpretability tools
Tripwires or watchdog AIs in the same datacenter that don’t succeed in stopping the action no longer have the power to interfere after, so collusion/coordination is solved within the datacenter thereafter
Backdoors can be hard for (non-mind-reading/weight-sharing) critics to find (see cyberwarfare and cybersecurity) without the knowledge of the creator: the effort to find or insert one backdoor to use is much less than that required to find and patch all exploits an attacker might be using
Although a critic who knows a backdoor can have a relatively easy time demonstrating it, and critics should be designed to access the knowledge of the attacker/misbehaving model or be heads of it
Critic AIs can be left in the position of saying ‘a backdoor has probably been inserted somewhere or will be soon but I can’t tell you where,’ a known danger for AI safety through debate
In an intelligence explosion context where AI capabilities are growing incredibly rapidly thanks to AI labor, there isn’t time for humans to do unaided examination of new AI-produced software while keeping up the pace, at best briefly examining some AI attempts to point out suspicious attempts to insert vulnerabilities
It’s plausible that the different abilities of AIs lead to optimal code design for them being harder for humans to understand (with less experience and working memory, and eventually general intelligence)
If you don’t let AIs design your new datacenter software, not only will it be less efficient, but it may be easily cracked by outside AI-enabled attackers
We are already routinely training big models with code repositories, and using programming as an easy virtual RL task (tests can provide immediate feedback cheaply), so these are skills we are moving towards providing to AI
We know exploits and backdoors are possible because humans can already do them