There is an algorithm called “Evolution strategies” popularized by OpenAI (although I believe that in some form it already existed) that can train neural networks without backpropagation and without storing multiple sets of parameters. You can view it as a population 1 genetic algorithm, but it really is a stochastic finite differences gradient estimator.
On supervised learning tasks it is not competitive with backpropagation, but on reinforcement learning tasks (where you can’t analytically differentiate the reward signal so you have to estimate the gradient one way or the other) it is competitive. Some follow-up works combined it with backpropagation.
I wouldn’t be surpised if the brain does something similar, since the brain never really does supervised learning, it’s either unsupervised or reinforcement learning. The brain could combine local reconstruction and auto-regression learning rules (similar to the layerwise-trained autoencoders, but also trying to predict future inputs rather than just reconstructing the current ones) and finite differences gradient estimation on reward signals propagated by the the dopaminergic pathways.
Not so fast.
Keep in mind that the utility function is defined up to an arbitrary positive affine transformation, while the softmax distribution is invariant only up to shifts: P(w)∝exp(βU(w)) will be different distribution depending on the inverse temperature β (the higher, the more peaked the distribution will be on the mode), while in von Neumann–Morgenstern theory of utility, U(w) and ^U≡βU(w) represent the same preferences for any positive β .
It’s not exactly the same.
Let’s assume that there are two possible world states: 0 and 1, and two available actions: action A puts the world in state 0 with 99% probability (QA(0)=0.99) while action B puts the world in state 0 with 50% probability (QB(0)=0.5).
Let U(0)=10−3,U(1)=0
Under expected utility maximizaiton, action A is clearly optimal.
Now define P(w)∝exp(U(w))
The expected log-probability (the negative cross-entropy) −H(P,QA) is ≈−2.31 nats, while −H(P,QB) is −0.69 , hence action B is optimal.
You do get the action A as optimal if you reverse the distributions in the negative cross-entropies (−H(QA,P) and −H(QB,P)), but this does not correspond to how inference is normally done.