We can probably prevent deceptive alignment by preventing situational awareness entirely using training runs in sandbox simulations, wherein even a human level AI would not be able to infer correct situational awareness. Models raised in these environments would not have much direct economic value themselves, but it allows for safe exploration and evaluation of alignment for powerful architectures. Some groups are training AIs in minecraft for example, so that already is an early form of sandbox sim.
Training an AI in minecraft is enormously safer than training on the open internet. AIs in the former environment can scale up to superhuman capability safely, in the latter probably not. We’ve already scaled up AI to superhuman levels in simple games like chess/go, but those environments are not complex enough in the right ways to evaluate altruism and alignment in multi-agent scenarios.
We can probably prevent deceptive alignment by preventing situational awareness entirely using training runs in sandbox simulations, wherein even a human level AI would not be able to infer correct situational awareness. Models raised in these environments would not have much direct economic value themselves, but it allows for safe exploration and evaluation of alignment for powerful architectures. Some groups are training AIs in minecraft for example, so that already is an early form of sandbox sim.
Training an AI in minecraft is enormously safer than training on the open internet. AIs in the former environment can scale up to superhuman capability safely, in the latter probably not. We’ve already scaled up AI to superhuman levels in simple games like chess/go, but those environments are not complex enough in the right ways to evaluate altruism and alignment in multi-agent scenarios.