How AI could workaround goals if rated by people

There are some suggestions on how to align AI with human values based on having operators rate AI’s actions.

There is always a possibility that operators are unaligned to other people’s values themselves; however, there is also a second risk.


1. (Way of tricking the goal) In future, AI could become able to raise children; even now it can significantly affect their beliefs. For example, I think rather big amount of people have ever talked to ChatGPT, some of them have been talking about AI, and someone could believe in GPT’s claims. Then, after maybe twenty or thirty years, majority of people would have value drift relative to us, so then AI could choose its goals and objectives more freely.


2. (Way of deception) AI could create artificial humanoid robots, make them look like people (real people’s images and even videos can be downloaded from Internet) and somehow make them indistinguishable from humans. Then, by most people’s definition, common human values would have to include values of this robots (because no one would know they are robots), which in turn gives AI some degree of freedom.