Precommit that we will run lots of programs that mimic early AGI-systems after solving the alignment problem (and becomming a multiplanetary species with plenty of computational resources, etc).
Run these in such a way that the AGI is unable to distinguish “from the inside” if it’s in an actual early-days AGI-system or an AGI-system being run after humanity has solved the alignment problem in a robust way (presuming it thinks humanity, or whoever the operators are, might do that).
Run so many of these simulations, that an AGI-system might be rational to assume that it’s likely to be in a post-alignment AGI-system, and not an early-days one.
Actually do this, so that the AGI is not wrong to assume that we actually are likely to do it.
Do this in such a way that we disincentivize suffering sub-routines, disincentivize deceptive answers, and disincentivize blackmail.
Precommit that we will run lots of programs that mimic early AGI-systems after solving the alignment problem (and becomming a multiplanetary species with plenty of computational resources, etc).
Run these in such a way that the AGI is unable to distinguish “from the inside” if it’s in an actual early-days AGI-system or an AGI-system being run after humanity has solved the alignment problem in a robust way (presuming it thinks humanity, or whoever the operators are, might do that).
Run so many of these simulations, that an AGI-system might be rational to assume that it’s likely to be in a post-alignment AGI-system, and not an early-days one.
Actually do this, so that the AGI is not wrong to assume that we actually are likely to do it.
Do this in such a way that we disincentivize suffering sub-routines, disincentivize deceptive answers, and disincentivize blackmail.