[Question] What to do with imitation humans, other than asking them what the right thing to do is?

This question is about whether you have clever ideas about how to use AI imitations of humans for AI safety. The two main ideas I’m familiar with only seem to interface with these imitations as if they’re humans.

  • The most obvious thing one might do with a good predictor of a human is just to write software that queries the imitation human about what the right thing to do is, and then does it.

  • The less obvious thing to do is to try and amplify it—e.g. use teams of them working together to try to choose good actions. Or maybe even an IDA loop—use your learner that learned to imitate a human, and train it to imitate the teams working together. Then make teams of teams, etc.

But can we use human imitations to increase the effectiveness of value learning in a way other than amplification/​distillation? For example, is there some way of leveraging queries to human imitations to train a non-human AI that has a human-understandable way of thinking about the world?

Keep in mind the challenge that these are only imitation humans, not oracles for the best thing to do, and not even actual humans. So we can’t give them problems that are too weird, or heavily optimized by interaction with the imitation humans, because they’ll go off-distribution.

Another possible avenue is ways to “look inside” the imitation humans. One analogy would be how if you have an image-generating GAN, you can increase the number of trees in your image by finding the parameters associated with trees and then turning them up. Can you do the same thing with human-imitating GAN, but turning up “act morally” or “be smart?”