I present a simple Deep-RL flavour idea for learning an agent’s impact that I’m thinking of trying out.
I don’t, ATM, think it’s very satisfying from a safety point of view, but I think it’s at least a bit relevant, so I’m posting here for feedback, iyi.
IDEA:
Instead of learning
P(st+1|st,at)
with a single network, instead, learn it as:
P(st+1|st,at)=I(st+1|st,at)⨁T(st+1|st).
The ⨁ could mean mixing the distributions, adding the preactivations, or adding the samples from T and I. I think adding the samples probably makes the most sense in most cases.
Learning Impact in RL
I present a simple Deep-RL flavour idea for learning an agent’s impact that I’m thinking of trying out. I don’t, ATM, think it’s very satisfying from a safety point of view, but I think it’s at least a bit relevant, so I’m posting here for feedback, iyi.
IDEA: Instead of learning
P(st+1|st,at)
with a single network, instead, learn it as:
P(st+1|st,at)=I(st+1|st,at)⨁T(st+1|st).
The ⨁ could mean mixing the distributions, adding the preactivations, or adding the samples from T and I. I think adding the samples probably makes the most sense in most cases.
Now, I is trained to capture the agent’s impact, and T should learn the “passive dynamics”. Apparently things like this have been tried before (not using DL, AFAIK, though), e.g.https://papers.nips.cc/paper/3002-linearly-solvable-markov-decision-problems.pdf
If we do a good job of disentangling an agent’s impact from the passive dynamics, then we can do reduced-impact in a natural way.
This idea was inspired by internal discussions at MILA/RLLAB and the Advantage-function formulation of value-based RL.