Donald Hobson comments on Alignment as Game Design

Donald Hobson 19 Jul 2022 16:31 UTC
3 points
0
In a GAN, one network tries to distinguish real images from fake. The other network tries to produce fake images that fool the first net. Both of these are simple formal tasks.
“exploits in the objective function” could be considered as “solutions that score highly that the programmers didn’t really intend”. The problem is that its hard to formalize what the programmers really intended. Given an evolutionary search for walking robots, a round robot that tumbles over might be a clever unexpected solution, or reward hacking, depending on the goals of the developers. Are the robots intended to transport anything fragile? Anything that can’t be spun and tossed upsidown? Whether the tumblebot is a clever unexpected design, or a reward hack depends on things that are implicit in the developers minds, not part of the program at all.
A lot of novice AI safety ideas look like “AI 1 has this simple specifiable reward function. AI 2 oversees AI 1. AI 2 does exactly what we want, however hard that is to specify and is powered by pure handwavium”