So let’s start with a basic example, an agent that has two actions, “don’t” and “do”. Suppose it has an output neuron that contains the logits for what action to take, and for simplicity’s sake (will address this in the end of the post) let’s assume that this output neuron is controlled by a single weight w which represents its bias. So this means that the p variable described in the OP expands into: p=P(a=do)=sigmoid(w).

We can then compute ∇wp(w)=p(w)(1−p(w)). And, hmm, this actually implies that ntvar(p)=p2(1−p)2, rather than the p(1−p) that my intuition suggested, I think? The difference is basically that p2(1−p)2 is flatter than p(1−p), especially in the tails where the former quadratically goes to 0 while the latter linearly goes to 0.

One thing I would wonder is what happens during training, if we e.g. use policy gradients and give a reward of 1 for do and a reward of −1 for don’t. The update rule for policy gradients is basically r∇wlogp(w), which according to Wolfram Alpha expands into 2(1+e−w)(1+ew), and which we can further simplify to 2p(1−p). But we would have to square it to get ntvar(r), so I guess the same point applies here as to before. 🤷

Anyway, obviously this is massively simplified because we are assuming a trivial neural network. In a nontrivial one, I think the principle would be the same, due to the chain rule which gives you a factor of p(1−p) onto whatever gradients exist before the final output neuron.

Sure!

So let’s start with a basic example, an agent that has two actions, “don’t” and “do”. Suppose it has an output neuron that contains the logits for what action to take, and for simplicity’s sake (will address this in the end of the post) let’s assume that this output neuron is controlled by a single weight w which represents its bias. So this means that the p variable described in the OP expands into: p=P(a=do)=sigmoid(w).

We can then compute ∇wp(w)=p(w)(1−p(w)). And, hmm, this actually implies that ntvar(p)=p2(1−p)2, rather than the p(1−p) that my intuition suggested, I think? The difference is basically that p2(1−p)2 is flatter than p(1−p), especially in the tails where the former quadratically goes to 0 while the latter linearly goes to 0.

One thing I would wonder is what happens during training, if we e.g. use policy gradients and give a reward of 1 for do and a reward of −1 for don’t. The update rule for policy gradients is basically r∇wlogp(w), which according to Wolfram Alpha expands into 2(1+e−w)(1+ew), and which we can further simplify to 2p(1−p). But we would have to square it to get ntvar(r), so I guess the same point applies here as to before. 🤷

Anyway, obviously this is massively simplified because we are assuming a trivial neural network. In a nontrivial one, I think the principle would be the same, due to the chain rule which gives you a factor of p(1−p) onto whatever gradients exist before the final output neuron.