I know you’ve acknowledged Friston at the end, but I’m just commenting for other interested readers’ benefit that this is very close to Karl Friston’s active inference framework, which posits that all agents minimise the discrepancies (or prediction errors) between their internal representations of the world and their incoming sensory information through both action and perception.
It’s worth emphasising just how closely related it is. Fristons’ expected free energy of a policy isG(π)=EQ(sτ∣π)DKL[Q(sτ∣π)∣∣Q(sτ∣oτ)]−EQ(sτ,oτ∣π)lnP(oτ), where the first term is the expected information gained by following the policy and the second the expected ‘extrinsic value’.
The extrinsic value term −EQ(sτ,oτ∣π)lnP(oτ), translated into John’s notation and setup, is precisely E[−logP(X|M2)∣M1(θ)]. Where John has optimisers choosing θ to minimise the cross-entropy of X under M2 with respect to X under M1, Friston has agents choosing π to minimise the cross-entropy of preferences (P) with respect to beliefs (Q).
What’s more, Friston explicitly thinks of the extrinsic value term −EQ(sτ,oτ∣π)lnP(oτ) as a way of writing expected utility (see the image below from one of his talks). In particular P is a way of representing real-valued preferences as a probability distribution. He often constucts P by writing down a utility function and then taking a softmax (like in this rat T-maze example), which is exactly what John’s construction amounts to.
It seems that John is completely right when he speculates that he’s rediscovered an idea well-known to Karl Friston.
I know you’ve acknowledged Friston at the end, but I’m just commenting for other interested readers’ benefit that this is very close to Karl Friston’s active inference framework, which posits that all agents minimise the discrepancies (or prediction errors) between their internal representations of the world and their incoming sensory information through both action and perception.
It’s worth emphasising just how closely related it is. Fristons’ expected free energy of a policy isG(π)=EQ(sτ∣π)DKL[Q(sτ∣π)∣∣Q(sτ∣oτ)]−EQ(sτ,oτ∣π)lnP(oτ), where the first term is the expected information gained by following the policy and the second the expected ‘extrinsic value’.
The extrinsic value term −EQ(sτ,oτ∣π)lnP(oτ), translated into John’s notation and setup, is precisely E[−logP(X|M2)∣M1(θ)]. Where John has optimisers choosing θ to minimise the cross-entropy of X under M2 with respect to X under M1, Friston has agents choosing π to minimise the cross-entropy of preferences (P) with respect to beliefs (Q).
What’s more, Friston explicitly thinks of the extrinsic value term −EQ(sτ,oτ∣π)lnP(oτ) as a way of writing expected utility (see the image below from one of his talks). In particular P is a way of representing real-valued preferences as a probability distribution. He often constucts P by writing down a utility function and then taking a softmax (like in this rat T-maze example), which is exactly what John’s construction amounts to.
It seems that John is completely right when he speculates that he’s rediscovered an idea well-known to Karl Friston.