Learning Impact in RL

I present a simple Deep-RL flavour idea for learning an agent’s impact that I’m thinking of trying out. I don’t, ATM, think it’s very satisfying from a safety point of view, but I think it’s at least a bit relevant, so I’m posting here for feedback, iyi.

IDEA: Instead of learning

with a single network, instead, learn it as:

.

The could mean mixing the distributions, adding the preactivations, or adding the samples from and . I think adding the samples probably makes the most sense in most cases.

Now, is trained to capture the agent’s impact, and should learn the “passive dynamics”. Apparently things like this have been tried before (not using DL, AFAIK, though), e.g.https://​​papers.nips.cc/​​paper/​​3002-linearly-solvable-markov-decision-problems.pdf

If we do a good job of disentangling an agent’s impact from the passive dynamics, then we can do reduced-impact in a natural way.

This idea was inspired by internal discussions at MILA/​RLLAB and the Advantage-function formulation of value-based RL.