I watch some nice conferences from the MIRI team about the challenges in the area of AI Alignment (specifying the right utility function,
making an AI reliable for the task, ways to teach an AI etcetera). While looking to these information, and given the reward and utility functions
paradigm they used, it came to my mind, what if we could use some of the theory developed in Mechanism Design to incentive desirable Social Choice Functions for AI agents?
and guarantee nice properties like knowing the agent won’t lie about private information (maybe intentions, source code, etc.) or other axiomatic principles like Pareto, Non Dictatorial etc.
I am wondering if anyone has given a though to this and if they are willing to share what their analysis is?
(for anyone interested in this area I leave a link with a good introduction).