Learning Impact in RL

I present a simple Deep-RL flavour idea for learning an agent’s impact that I’m thinking of trying out. I don’t, ATM, think it’s very satisfying from a safety point of view, but I think it’s at least a bit relevant, so I’m posting here for feedback, iyi.

IDEA: Instead of learning

$P (s_{t + 1} | s_{t}, a_{t})$

with a single network, instead, learn it as:

$P (s_{t + 1} | s_{t}, a_{t}) = I (s_{t + 1} | s_{t}, a_{t}) ⨁ T (s_{t + 1} | s_{t})$ .

The $⨁$ could mean mixing the distributions, adding the preactivations, or adding the samples from $T$ and $I$ . I think adding the samples probably makes the most sense in most cases.

Now, $I$ is trained to capture the agent’s impact, and $T$ should learn the “passive dynamics”. Apparently things like this have been tried before (not using DL, AFAIK, though), e.g.https://papers.nips.cc/paper/3002-linearly-solvable-markov-decision-problems.pdf

If we do a good job of disentangling an agent’s impact from the passive dynamics, then we can do reduced-impact in a natural way.

This idea was inspired by internal discussions at MILA/RLLAB and the Advantage-function formulation of value-based RL.