paulfchristiano comments on Some work on connecting UDT and Reinforcement Learning

paulfchristiano 29 Jan 2016 19:36 UTC
0 points
0
AF
The difference between EDT and CDT only appears when there are non-causal correlations between the environment and agent’s choice of policy. But in the setting you described, the only impact of the policy is on the agent’s actions, which then causally affect the environment. In this setting, EDT always makes the same recommendation as CDT, since conditioning is the same as causal intervention.

UDT also makes the same choices as EDT, because the “behavior after observing X” only affects what happens after observing X. So it doesn’t matter whether we update.

I might be missing some aspect of the model that correlates decisions and outcomes though.

One candidate is forgetfulness. If you don’t keep a record of your past states, then your decision has an impact both on what happens in the future, but it also affects your beliefs about what state you are currently in. So I guess if we have forgetting, then the model you described can capture a difference.

It seems like RL with forgetting is generally pretty subtle subject. As far as I can tell direct policy search is the only algorithm people use in this setting. If episodes are independent, this is the same as UDT (not CDT).
- IAFF-User-111 11 Feb 2016 21:40 UTC
  LW: 2 AF: 1
  0
  AF Parent
  “But in the setting you described, the only impact of the policy is on the agent’s actions”
  
  I don’t think so. P_M(\zeta | \pi) is meant to describe the distribution over trajectories given a policy (according to the model). Unless I’m missing something, the model could contain non-causal correlations.
  - paulfchristiano 17 Feb 2016 23:43 UTC
    0 points
    0
    AF Parent
    I see; you’re right.
    
    You mention that $P_{M}$ could reflect the true dynamics of the environment; I read that and assumed it was a causal model mapping a (state, action) pair to the next state. But if it captures a more general state of uncertainty, then this does pick up the difference between EDT/UDT and CDT.
    
    Note that if $P_{M}$ reflects the agent’s logical uncertainty about its own behavior, then we can’t generally expand the expectations out as an integral over possible trajectories.
    
    For example, if I train one model to map x to E[y(x)|x], and I train another model to map (x,b) to P(y(x)=b), then the two quantities won’t generally be related by integration.
    
    When thinking about the connection between theoretical frameworks and practical algorithms, I think this would be an interesting issue to push on.