epicurus comments on Convergent Abstraction Hypothesis

epicurus 15 May 2026 9:04 UTC
3 points
0
I have converged (ha!) to similar views recently. I think it is worth trying to make this a lot more precise actually. Let me take a simplified version of a standard ML training set up. So we have some dataset D that samples a subset of all possible inputs A with binary labels in {0,1} and a neural network architecture that defines for you a parameter space and an associated function space F: A to {0,1}. Points of this function space correspond to a “labelling function” on D, and in particular there is a subspace S of F that is the set of “correct functions”, i.e., functions that match on the training set. In general, our optimization algorithms tend to find a point in S always, i.e., they minimize loss on the training set.

Now there is also a test set that is a even smaller subspace T of S. For training to have worked, or to say that the trained net generalizes correctly, what we really mean is that the optimization algorithm finds not just a point in S, but a point in T. So we see that “correct training” is naturally a function that depends on these two nested subspaces (T \subset S). And the power of neural networks is somehow really that they find a much smaller subspace consistently than what training would require (each test data point cuts down the size of the space by around a factor of 2).

Does this help us make more precise your convergent abstraction hypothesis? I think so. I think the key point is that data sets are naturally generated by learners (often humans). So if we have a trained net or a human who can assign labels to data points, we can generate the training data set D by prompting the neural network, and similarly for the test data set.

Then when we train a different neural network on the outputs of the first, for learning to converge behaviorally is to say that they both identify the smaller space T inside S as the “important” one.

I am not sure how legible that was, I am finding this comment box hard to express mathematical ideas in...

===

Anyway, the upshot is that I think this lets us directly compare two learners. If learner A is trained on the outputs of layer B, do they generalize in similar ways? Do they find the right subsets of the function space as the effective target space?