I’ve done some work on a definition of optimization which applies to “trajectories” in deterministic, differentiable models. What happens when we try and introduce uncertainty?
Suppose we have the following system consisting of three variables, the past P, future F, and some agent A. The agent “acts” on the system to push the value of F 80% of the way towards being zero. We can think of this as follows: A=0.8P,F=P−A. Under these circumstances, ∂F∂P|Avaries/∂P∂F|Aconstant=0.2 which means our optimization function gives: Op(P,F;A)=−log(|0.2|)≈−1.61nats.
What if we instead consider a normal distribution over P? This must be parameterized by a mean μP and a standard deviation σP. Our formulae now look like this:
P∼N(μP,σP) A∼N(0.8μP,0.8σP) F∼N(0.2μP,0.2σP)
So what does it look like for A to “not depend” on P? We could just “pick” some value for A but this seems like cheating. What if we set up a new model, in which F′ depends on P′ and A′, but A′ depends on P′′ instead of P′? We can allow P′ and P′′ to have the same distributions as before:
P′∼N(μP,σP) P′′∼N(μP,σP) A∼N(0.8μP,0.8μP)
Calculating F is a bit more difficult. We can think of it as adding two uncorrelated normal distributions together. For normal distributions this just means adding the means and variances together. Our distributions have means μP and −0.8μP, and variances σ2P and 0.64σ2P. Therefore we get a new distribution with mean 0.2μP and variance 1.64σ2P. This gives a standard deviation of 1.28σP.
F′∼N(0.2μP,1.28σP)
What’s the entropy of a normal distribution? Well, it’s difficult to say properly, since entropy is poorly-defined on continuous variables. If one take the limiting density of discrete points one gets log(N)+12log(2πeσ2), where N goes to infinity. This is a problem unless we happen to be subtracting one entropy from another. So let’s do that.
Ok so we got the sign wrong the first time. Nevermind. But there is another issue, this is higher than our previous value. This is because we’re double-counting the variance from P. We get the variance from P′ and P′′ in F′. We can correct this by changing the object of study from H(F′) to H(F′|P′′). This works exactly like you’d expect: it gives a weighted average of the value of H(F′|P′′=p′′) for all possible values of p′′. In this case it is trivial: for any fixed value of p′′ we get F′∼N(μP−0.8p′′,σ2P). So lets take a look:
In any Bayes-ish net-ish model, if we can get an agent’s behaviour in the following form:
We can make the following transformation, and get Op(F,P;A)=H(F′|P′′)−H(F).
I will think more about whether this extension is properly valid. One limitation is that we cannot have multiple sets of arrows into and out of A, since this would mess with the splitting of P.
Briefly Extending Differential Optimization to Distributions
I’ve done some work on a definition of optimization which applies to “trajectories” in deterministic, differentiable models. What happens when we try and introduce uncertainty?
Suppose we have the following system consisting of three variables, the past P, future F, and some agent A. The agent “acts” on the system to push the value of F 80% of the way towards being zero. We can think of this as follows: A=0.8P, F=P−A. Under these circumstances, ∂F∂P|A varies/∂P∂F|A constant=0.2 which means our optimization function gives: Op(P,F;A)=−log(|0.2|)≈−1.61 nats.
What if we instead consider a normal distribution over P? This must be parameterized by a mean μP and a standard deviation σP. Our formulae now look like this:
P∼N(μP,σP)
A∼N(0.8μP,0.8σP)
F∼N(0.2μP,0.2σP)
So what does it look like for A to “not depend” on P? We could just “pick” some value for A but this seems like cheating. What if we set up a new model, in which F′ depends on P′ and A′, but A′ depends on P′′ instead of P′? We can allow P′ and P′′ to have the same distributions as before:
P′∼N(μP,σP)
P′′∼N(μP,σP)
A∼N(0.8μP,0.8μP)
Calculating F is a bit more difficult. We can think of it as adding two uncorrelated normal distributions together. For normal distributions this just means adding the means and variances together. Our distributions have means μP and −0.8μP, and variances σ2P and 0.64σ2P. Therefore we get a new distribution with mean 0.2μP and variance 1.64σ2P. This gives a standard deviation of 1.28σP.
F′∼N(0.2μP,1.28σP)
What’s the entropy of a normal distribution? Well, it’s difficult to say properly, since entropy is poorly-defined on continuous variables. If one take the limiting density of discrete points one gets log(N)+12log(2πeσ2), where N goes to infinity. This is a problem unless we happen to be subtracting one entropy from another. So let’s do that.
H(F)−H(F′)=log(N)+12log(2πeσ2F)−log(N)−12log(2πeσ2F′)
H(F)−H(F′)=12log(σ2F)−12log(σ2F′)
H(F)−H(F′)=log(σF)−log(σF′)
H(F)−H(F′)=log(0.2σP)−log(1.28σP)
H(F)−H(F′)=log(0.2/1.28)≈−1.86 nats
Ok so we got the sign wrong the first time. Nevermind. But there is another issue, this is higher than our previous value. This is because we’re double-counting the variance from P. We get the variance from P′ and P′′ in F′. We can correct this by changing the object of study from H(F′) to H(F′|P′′). This works exactly like you’d expect: it gives a weighted average of the value of H(F′|P′′=p′′) for all possible values of p′′. In this case it is trivial: for any fixed value of p′′ we get F′∼N(μP−0.8p′′,σ2P). So lets take a look:
H(F′|P′′)−H(F)=12log(σ2F′|P′′)−12log(σ2F)
H(F′|P′′)−H(F)=log(σF′|P′′)−log(σF)
H(F′|P′′)−H(F)=log(σF)−12log(0.2σF)
H(F′|P′′)−H(F)=−12log(0.2/1)≈1.61 nats
In any Bayes-ish net-ish model, if we can get an agent’s behaviour in the following form:
We can make the following transformation, and get Op(F,P;A)=H(F′|P′′)−H(F).
I will think more about whether this extension is properly valid. One limitation is that we cannot have multiple sets of arrows into and out of A, since this would mess with the splitting of P.