Perhaps we can use this defense of theory instinct as a simplified map of what we want the AI to do.
So create a paperclip maximizer that is (handwave) somehow restricted from doing anything that it’s creator would try to convince people it would never do.
This is assuming that how we steer ourselves away from horrid ideas is simpler than how we decide we like something.
Perhaps we can use this defense of theory instinct as a simplified map of what we want the AI to do.
So create a paperclip maximizer that is (handwave) somehow restricted from doing anything that it’s creator would try to convince people it would never do.
This is assuming that how we steer ourselves away from horrid ideas is simpler than how we decide we like something.