Charlie Steiner comments on Giving AIs safe motivations

Charlie Steiner 19 Aug 2025 10:30 UTC
2 points
0
Thanks! I think your perspective is important for me to engage with, since I’m mostly concerned with doing step 1 much better than what you think of as succeeding at step 1.
In particular, the problem of evaluating performance even in safe situations seems like something we could do much better at “if we knew what we were doing” (for hard problems and for manipulation, which you mention, and for ambiguity/underspecification, which is easy to forget about).
So prong one is to try to know what we’re doing better—e.g. by finding improvements to architectures and training schemes to support good performance evaluations. And prong two is to figure out how to better muddle ahead with bad evaluations, “the AI will be misaligned but hopefully it’s not too bad and we can do other things to compensate” style.
A random nitpick:
In particular: the early discourse about AI alignment seemed quite concerned, in various ways, about the problem of crafting/specifying good instructions.
“With a safe genie, wishing is superfluous. Just run the genie.”—The Hidden Complexity of Wishes (2007)