I’m sceptical of any approach to alignment that involves finding a perfect ungameable utility function.
Even if you could find one, and even if you could encode it accurately when training the AI, that only effects outer alignment.
What really matters for AI safety is inner alignment. And that’s very unlikely to pick up all the subtle nuances of a complex utility function.
I’m sceptical of any approach to alignment that involves finding a perfect ungameable utility function.
Even if you could find one, and even if you could encode it accurately when training the AI, that only effects outer alignment.
What really matters for AI safety is inner alignment. And that’s very unlikely to pick up all the subtle nuances of a complex utility function.