This seems clearly false in the case of deep learning, where progress on instilling any particular behavioral tendencies in models roughly follows the amount of available data that demonstrate said behavioral tendency. It’s thus vastly easier to align models to goals where we have many examples of people executing said goals. As it so happens, we have roughly zero examples of people performing the “duplicate this strawberry” task, but many more examples of e.g., humans acting in accordance with human values, ML / alignment research papers, chatbots acting as helpful, honest and harmless assistants, people providing oversight to AI models, etc. See also: this discussion. [emphasis mine]
The thing that makes powerful AI powerful is that it can figure out how to do things that we don’t know how to do yet, and therefore don’t have examples of. The key question for aligning superintelligences is “how do they generalize in new domains that are beyond what humans were able to do / reason about / imagine.
The thing that makes powerful AI powerful is that it can figure out how to do things that we don’t know how to do yet, and therefore don’t have examples of. The key question for aligning superintelligences is “how do they generalize in new domains that are beyond what humans were able to do / reason about / imagine.