“we show how to get agents whose long-term plans follow strategies that humans can predict”. But maybe no single human actually understands the strategy. Or maybe the traders are correctly guessing that the model’s steps will somehow lead to whatever is defined as a “good outcome”, even if they don’t understand how, which has similar problems to the RL reward from the future that you’re trying to avoid.
Discussed in the paper in Section 6.3, bullet point 3. Agreed that if you’re using a prediction market it’s no longer accurate to say that individual humans understand the strategy.
nit: I wouldn’t use a prediction market as an overseer because markets are often uninterpretable to humans, which would miss some of the point[1].
“we show how to get agents whose long-term plans follow strategies that humans can predict”. But maybe no single human actually understands the strategy. Or maybe the traders are correctly guessing that the model’s steps will somehow lead to whatever is defined as a “good outcome”, even if they don’t understand how, which has similar problems to the RL reward from the future that you’re trying to avoid.
Discussed in the paper in Section 6.3, bullet point 3. Agreed that if you’re using a prediction market it’s no longer accurate to say that individual humans understand the strategy.