Prediction in the supervised task might not care about the full latent space used for the unsupervised tasks, losing information.
Little to no protection from Goodhart’s law. Things that are extremely good proxies for human values still might not be safe to optimize.
Doesn’t care about metaethics, just maximizes some fixed thing. Which wouldn’t be a problem if it was meta-ethically great to start with, but it probably incorporates plenty of human foibles in order to accurately predict us.
The killer is really that second one. If you run this supervised learning process, and it gives you a bunch of rankings of things in terms of their human values score, this isn’t a safe AI even if it’s on average doing a great job, because the thing that gets the absolute best score is probably an exploit of the specific pattern-recognition algorithm used to do the ranking. In short, we still need to solve the other-izer problem.
Actually, your trees example does give some ideas. Could you look inside a GAN trained on normal human behavior and identify what parts of it were the “act morally” or “be smart” parts, and turn them up? Choosing actions is, after all, a generative problem, not a classification or regression problem.
This is the sort of thing I’ve been thinking about since “What’s the dream for giving natural language commands to AI?” (which bears obvious similarities to this post). The main problems I noted there apply similarly here:
Prediction in the supervised task might not care about the full latent space used for the unsupervised tasks, losing information.
Little to no protection from Goodhart’s law. Things that are extremely good proxies for human values still might not be safe to optimize.
Doesn’t care about metaethics, just maximizes some fixed thing. Which wouldn’t be a problem if it was meta-ethically great to start with, but it probably incorporates plenty of human foibles in order to accurately predict us.
The killer is really that second one. If you run this supervised learning process, and it gives you a bunch of rankings of things in terms of their human values score, this isn’t a safe AI even if it’s on average doing a great job, because the thing that gets the absolute best score is probably an exploit of the specific pattern-recognition algorithm used to do the ranking. In short, we still need to solve the other-izer problem.
Actually, your trees example does give some ideas. Could you look inside a GAN trained on normal human behavior and identify what parts of it were the “act morally” or “be smart” parts, and turn them up? Choosing actions is, after all, a generative problem, not a classification or regression problem.