I’m pretty happy with modeling SGD on deep nets as solomonoff induction, but seems like the key missing ingredient is path dependence. What’s the best literature on this? Lots of high level alignment plans rely on path dependence (shard theory, basin of corrigibility...)
Maybe the best answer is just: SGD is ~local, modulated by learning rate. But this doesn’t integrate lottery ticket hypothesis stuff (which feels like it pushes hard against locality)
Possibly universality of Solomonoff induction should be less relevant in this context, since that constant-bounded difference in minimal program lengths is significant enough that it could represent expected utility as Jeffrey-Bolker preference[1], where expected utility of an event (for a bounded utility function) is given by the ratio of two different probability measures of this event.
This way, universality of an algorithmic distribution kinda corresponds to boundedness of a utility function, but the data of a utility function isn’t something to be necessarily dismissed. So path dependence (or random initialization dependence) might vaguely correspond to different languages/interpreters for Solomonoff induction. And maybe you need pairs of languages/interpreters (where some property of a specific pair is important) and not just one (whose choice doesn’t matter) to talk about agents (as an alternative to how AIXI does things).
the thing I’m referring to is Bolker’s Existence Theorem and possible alterations, including where the utility function is bounded and so both measures can remain positive.
I’m pretty happy with modeling SGD on deep nets as solomonoff induction, but seems like the key missing ingredient is path dependence. What’s the best literature on this? Lots of high level alignment plans rely on path dependence (shard theory, basin of corrigibility...)
Maybe the best answer is just: SGD is ~local, modulated by learning rate. But this doesn’t integrate lottery ticket hypothesis stuff (which feels like it pushes hard against locality)
Possibly universality of Solomonoff induction should be less relevant in this context, since that constant-bounded difference in minimal program lengths is significant enough that it could represent expected utility as Jeffrey-Bolker preference[1], where expected utility of an event (for a bounded utility function) is given by the ratio of two different probability measures of this event.
This way, universality of an algorithmic distribution kinda corresponds to boundedness of a utility function, but the data of a utility function isn’t something to be necessarily dismissed. So path dependence (or random initialization dependence) might vaguely correspond to different languages/interpreters for Solomonoff induction. And maybe you need pairs of languages/interpreters (where some property of a specific pair is important) and not just one (whose choice doesn’t matter) to talk about agents (as an alternative to how AIXI does things).
See for example
J Broome (1990) Bolker-Jeffrey Expected Utility Theory and Axiomatic Utilitarianism,
the thing I’m referring to is Bolker’s Existence Theorem and possible alterations, including where the utility function is bounded and so both measures can remain positive.