However, I don’t think this is quite right (unless I’m missing something):
Now observe that in the LCDT planning world model C constructed by marginalization, this knowledge of the goalkeeper is a known parameter of the ball kicking optimization problem that the agent must solve. If we set the outcome probabilities right, the game theoretical outcome will be that the optimal policy is for the agent to kicks right, so it plays the opposite move that the goalkeeper expects. I’d argue that this is a form of deception, a deceptive scenario that LCDT is trying to prevent.
I don’t think the situation is significantly different between B and C here. In B, the agent will decide to kick left most of the time since that’s the Nash equilibrium. In C the agent will also decide to kick left most of the time: knowing the goalkeeper’s likely action still leaves the same Nash solution (based on knowing both that the keeper will probably go left, and that left is the agent’s stronger side). If the agent knew the keeper would definitely go left, then of course it’d kick right—but I don’t think that’s the situation.
I’d be interested on your take on Evan’s comment on incoherence in LCDT. Specifically, do you think the issue I’m pointing at is a difference between LCDT and counterfactual planners? (or perhaps that I’m just wrong about the incoherence??) As I currently understand things, I believe that CPs are doing planning in a counterfactual-but-coherent world, whereas LCDT is planning in an (intentionally) incoherent world—but I might be wrong in either case.
However, I don’t think this is quite right (unless I’m missing something) [,,,] I don’t think the situation is significantly different between B and C here. In B, the agent will decide to kick left most of the time since that’s the Nash equilibrium. In C the agent will also decide to kick left most of the time: knowing the goalkeeper’s likely action still leaves the same Nash solution
To be clear: the point I was trying to make is also that I do not think that B and C are significantly different in the goalkeeper benchmark. My point was that we need to go to a random prior to produce a real difference.
But your question makes me realise that this goalkeeper benchmark world opens up a bigger can of worms than I expected. When writing it, I was not thinking about Nash equilibrium policies, which I associate mostly with iterated games, and I was specifically thinking about an agent design that uses the planning world to compute a deterministic policy function. To state what I was thinking about in different mathematical terms, I was thinking of an agent design that is trying to compute the action a that optimizes argmaxa in the non-iterated gameplay world C.
To produce the Nash equilibrium type behaviour you are thinking about (i.e. the agent will kick left most of the time but not all the time), you need to start out with an agent design that will use the C constructed by LCDT to compute a nondeterministic policy function, which it will then use to do compute its real world action. If I follow that line of thought, it I would need additional ingredients to make the agent actually compute that Nash equilibrium policy function. I would need need to have iterated gameplay in B, with mechanics that allow the goalkeeper to observe whether the agent is playing a non-Nash-equilibrium policy/strategy, so that the goalkeeper will exploit this inefficiency for sure if the agent plays the non-Nash-equilibrium strategy. The possibility of exploitation by the goalkeeper is what would push the optimal agent policy towards a Nash equilibrium. But interestingly, such mechanics where the goalkeeper can learn about a non-Nash agent policy being used might be present in an iterated version of the real world model B, but they will be removed by LCDT from an iterated version of C. (Another wrinkle: some AI algorithms for solving the optimal policy in a single-shot game in B or C would turn B or C into an iterated game automatically and then solve the iterated game. Such iteration might also update the prior, if we are not careful. But if we solve B or C analytically or with Monte Carlo simulation, this type of expansion to an iterated game will not happen.)
Hope this clarifies what I was thinking about. I think it is also true that, if the prior you use in your LCDT construction is that everybody is playing according to a Nash equilibrium, then agent may end up playing exactly that under LCDT.
(I plan to comment on your question about incoherence in a few days.)
Interesting, thanks.
However, I don’t think this is quite right (unless I’m missing something):
I don’t think the situation is significantly different between B and C here. In B, the agent will decide to kick left most of the time since that’s the Nash equilibrium. In C the agent will also decide to kick left most of the time: knowing the goalkeeper’s likely action still leaves the same Nash solution (based on knowing both that the keeper will probably go left, and that left is the agent’s stronger side).
If the agent knew the keeper would definitely go left, then of course it’d kick right—but I don’t think that’s the situation.
I’d be interested on your take on Evan’s comment on incoherence in LCDT. Specifically, do you think the issue I’m pointing at is a difference between LCDT and counterfactual planners? (or perhaps that I’m just wrong about the incoherence??)
As I currently understand things, I believe that CPs are doing planning in a counterfactual-but-coherent world, whereas LCDT is planning in an (intentionally) incoherent world—but I might be wrong in either case.
See the comment here for my take.
To be clear: the point I was trying to make is also that I do not think that B and C are significantly different in the goalkeeper benchmark. My point was that we need to go to a random prior to produce a real difference.
But your question makes me realise that this goalkeeper benchmark world opens up a bigger can of worms than I expected. When writing it, I was not thinking about Nash equilibrium policies, which I associate mostly with iterated games, and I was specifically thinking about an agent design that uses the planning world to compute a deterministic policy function. To state what I was thinking about in different mathematical terms, I was thinking of an agent design that is trying to compute the action a that optimizes argmaxa in the non-iterated gameplay world C.
To produce the Nash equilibrium type behaviour you are thinking about (i.e. the agent will kick left most of the time but not all the time), you need to start out with an agent design that will use the C constructed by LCDT to compute a nondeterministic policy function, which it will then use to do compute its real world action. If I follow that line of thought, it I would need additional ingredients to make the agent actually compute that Nash equilibrium policy function. I would need need to have iterated gameplay in B, with mechanics that allow the goalkeeper to observe whether the agent is playing a non-Nash-equilibrium policy/strategy, so that the goalkeeper will exploit this inefficiency for sure if the agent plays the non-Nash-equilibrium strategy. The possibility of exploitation by the goalkeeper is what would push the optimal agent policy towards a Nash equilibrium. But interestingly, such mechanics where the goalkeeper can learn about a non-Nash agent policy being used might be present in an iterated version of the real world model B, but they will be removed by LCDT from an iterated version of C. (Another wrinkle: some AI algorithms for solving the optimal policy in a single-shot game in B or C would turn B or C into an iterated game automatically and then solve the iterated game. Such iteration might also update the prior, if we are not careful. But if we solve B or C analytically or with Monte Carlo simulation, this type of expansion to an iterated game will not happen.)
Hope this clarifies what I was thinking about. I think it is also true that, if the prior you use in your LCDT construction is that everybody is playing according to a Nash equilibrium, then agent may end up playing exactly that under LCDT.
(I plan to comment on your question about incoherence in a few days.)