Tor Økland Barstad comments on Half-baked AI Safety ideas thread

Tor Økland Barstad 17 Jul 2022 15:50 UTC
1 point
0
1. Precommit that we will run lots of programs that mimic early AGI-systems after solving the alignment problem (and becomming a multiplanetary species with plenty of computational resources, etc).
2. Run these in such a way that the AGI is unable to distinguish “from the inside” if it’s in an actual early-days AGI-system or an AGI-system being run after humanity has solved the alignment problem in a robust way (presuming it thinks humanity, or whoever the operators are, might do that).
3. Run so many of these simulations, that an AGI-system might be rational to assume that it’s likely to be in a post-alignment AGI-system, and not an early-days one.
4. Actually do this, so that the AGI is not wrong to assume that we actually are likely to do it.
5. Do this in such a way that we disincentivize suffering sub-routines, disincentivize deceptive answers, and disincentivize blackmail.