A policy outputs a distribution over , and equations 3 and 4 define what this distribution is for the imitator. If it outputs (0, a), that means and and and if it outputs (1, a), that means and . When I say
The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,
that’s just describing the difference between equations 3 and 4. Look at equation 4 to see that when , the distribution over the action is equal to that of the demonstrator. So we describe the behavior that follows as “deferring to the demonstrator”. If we look at the distribution over the action when , it’s something else, so we say the imitator is “picking its own action”.
The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator or picking one of the other actions???
The 0 means the imitator is picking the action, and the means it’s not picking another action that’s not .
I don’t think this is a problem. There will be plenty of them, but when they’re wrong they’ll get removed from the posterior.