The Bayesian persuasion framework requires that the set of possible world states be defined in advance—and then the question becomes, given certain utility functions for the expert and decision-maker, what information about the world state should the expert commit to revealing?
I think that Bayesian persuasion might not be the right framework here, because we get to choose the AI’s reward function. Assume (as Bayesian persuasion does) that you’ve defined all possible world states.[1] Do you want to get the AI to reveal all the information—i.e. which particular world state we’re in—rather than a convenient subset (that it has precommitted to)? That seems straightforward: just penalize it really heavily if it refuses to tell you the world state.
I think the much bigger challenge is getting the AI to tell you the world state truthfully—but note that this is outside the scope of Bayesian persuasion, which assumes that the expert is constrained to the truth (and is deciding which parts of the truth they should commit to revealing).
“World states” here need not mean the precise description of the world, atom by atom. If you only care about answering a particular question (“How much will Apple stock go up next week?” then you could define the set of world states to correspond to relevant considerations (e.g. the ordered tuple of random variables (how many iPhones Apple sold last quarter, how much time people are spending on their Macs, …)). Even so, I expect that defining the set of possible world states to be practically impossible in most cases.
The Bayesian persuasion framework requires that the set of possible world states be defined in advance—and then the question becomes, given certain utility functions for the expert and decision-maker, what information about the world state should the expert commit to revealing?
I think that Bayesian persuasion might not be the right framework here, because we get to choose the AI’s reward function. Assume (as Bayesian persuasion does) that you’ve defined all possible world states.[1] Do you want to get the AI to reveal all the information—i.e. which particular world state we’re in—rather than a convenient subset (that it has precommitted to)? That seems straightforward: just penalize it really heavily if it refuses to tell you the world state.
I think the much bigger challenge is getting the AI to tell you the world state truthfully—but note that this is outside the scope of Bayesian persuasion, which assumes that the expert is constrained to the truth (and is deciding which parts of the truth they should commit to revealing).
“World states” here need not mean the precise description of the world, atom by atom. If you only care about answering a particular question (“How much will Apple stock go up next week?” then you could define the set of world states to correspond to relevant considerations (e.g. the ordered tuple of random variables (how many iPhones Apple sold last quarter, how much time people are spending on their Macs, …)). Even so, I expect that defining the set of possible world states to be practically impossible in most cases.