Garrett Baker comments on D0TheMath’s Shortform

Garrett Baker 17 Jan 2024 19:47 UTC
4 points
0
In Magna Alta Doctrina Jacob Cannell talks about exponential gradient descent as a way of approximating solomonoff induction using ANNs

While that approach is potentially interesting by itself, it’s probably better to stay within the real algebra. The Solmonoff style partial continuous update for real-valued weights would then correspond to a multiplicative weight update rather than an additive weight update as in standard SGD.

Has this been tried/evaluated? Why actually yes—it’s called exponentiated gradient descent, as exponentiating the result of additive updates is equivalent to multiplicative updates. And intriguingly, for certain ‘sparse’ input distributions the convergence or total error of EGD/MGD is logarithmic rather than the typical inverse polynomial of AGD (additive gradient descent): O(logN) vs O(1/N) or O(1/N2), and fits naturally with ‘throw away half the theories per observation’.

The situations where EGD outperforms AGD, or vice versa, depend on the input distribution: if it’s more normal then AGD wins, if it’s more sparse log-normal then EGD wins. The morale of the story is there isn’t one single simple update rule that always maximizes convergence/performance; it all depends on the data distribution (a key insight from bayesian analysis).

The exponential/multiplicative update is correct in Solomonoff’s use case because the different sub-models are strictly competing rather than cooperating: we assume a single correct theory can explain the data, and predict through an ensemble of sub-models. But we should expect that learned cooperation is also important—and more specifically if you look more deeply down the layers of a deeply factored net at where nodes representing sub-computations are more heavily shared, it perhaps argues for cooperative components.

So we get a criterion for when one should be a hedgehog versus a fox in forecasting: One should be a fox when the distributions you need to operate in are normal, or rather when it does not have long tails, and you should be a hedgehog when your input distribution is more log-normal, or rather when there may be long-tails.

This indeed makes sense. If you don’t have many outliers, most theories should agree with each other, its hard to test & distinguish between the theories, and if one of your theories does make striking predictions far different from your other theories, its probably wrong, just because striking things don’t really happen.

In contrast, if you need to regularly deal with extreme scenarios, you need theories capable of generalizing to those extreme scenarios, which means not throwing out theories for making striking or weird predictions. Striking events end up being common, so its less an indictment.