johnswentworth comments on Deep learning models might be secretly (almost) linear

johnswentworth 24 Apr 2023 22:13 UTC
16 points
2
Natural abstractions hypothesis. Most abstractions are naturally linear and compositional in some sense (why?).
One of my main current hypotheses about natural abstractions is that natural summary statistics are approximately additive across subsystems. It’s the same idea as “extensivity” in statistical physics, i.e. how energy and entropy are both approximately-additive across mesoscale subsystems. And it would occur for similar reasons: if not-too-close-together parts of the system are independent given some natural abstract latent variables, then we can break the system into a bunch of mesosize chunks with some space between each of them, ignore the relatively-small handful of variables in between the mesoscale chunks, and find that log probability of state is approximately additive across the chunks. That log probability is, in turn, “approximately a sufficient statistic” in some sense, because log likelihood is a universal sufficient statistic. So, we get an approximate sufficient statistic which is additive across the chunks.
… unfortunately the approximation is very loose, and more generally this whole argument dovetails with open questions about how to handle approximation for natural abstractions. So the math is not yet ready for prime time. But there is at least a qualitative argument for why we’d expect additivity across subsystems from natural abstractions.
My guess is that that rough argument is the main step in understanding why linearity seems to capture natural abstractions so well empirically.
- beren 25 Apr 2023 13:09 UTC
  6 points
  4
  Parent
  I think this is a good intuition. I think this comes down to the natural structure of the graph and the fact that information disappears at larger distances. This means that for dense graphs such as lattices etc regions only implicitly interact through much lower dimensional max-ent variables which are then additive while for other causal graph structures such as the power-law small-world graphs that are probably sensible for many real-world datasets, you also get a similar thing where each cluster can be modelled mostly independently apart from a few long-range interactions which can be modelled as interacting with some general ‘cluster sum’. Interestingly, this is how many approximate bayesian inference algorithms for factor graphs look like—such as the region graph algorithm. ( http://pachecoj.com/courses/csc665-1/papers/Yedidia_GBP_InfoTheory05.pdf).
  I definitely agree it would be really nice to have the math of this all properly worked out as I think this, as well as the region why we see power-law spectra of features so often in natural datasets (which must have a max-ent explanation) is a super common and deep feature of the world.