Shaping safer goals

How can we move the needle on AI safety? In this sequence I think through some approaches that don’t rely on precise specifications—instead they involve “shaping” our agents to think in safer ways, and have safer motivations. This is particularly relevant to the prospect of training AGIs in multi-agent (or other open-ended) environments.

Note that all of the techniques I propose here are speculative brainstorming; I’m not confident in any of them, although I’d be excited to see further exploration along these lines.

Multi-agent safety

Safety via se­lec­tion for obedience

Com­pet­i­tive safety via gra­dated curricula

AGIs as collectives

Safer sand­box­ing via col­lec­tive separation