Actually upon further thought for something like policy gradients, in the limit where the probability p is close to 0, then ntvar(a) would probably be more like O(p2)? Because you get a factor of p from the probability, and then an additional factor of p(1−p)=O(p) from the derivative of sigmoid/softmax, which adds up to it being O(p2).
Actually upon further thought for something like policy gradients, in the limit where the probability p is close to 0, then ntvar(a) would probably be more like O(p2)? Because you get a factor of p from the probability, and then an additional factor of p(1−p)=O(p) from the derivative of sigmoid/softmax, which adds up to it being O(p2).