I have a (still incomplete) draft here which specifically addresses why the configuration space metaphor works. Short version: the key property of (Bayesian) clustering is that the points in a cluster are conditionally independent given the summary data of the cluster. For instance, if I have Gaussian clusters, then each point within a given cluster is independent given the mean and variance of that cluster. The prototypical “clustering problem” is to assign points to clusters in such a way that this works. So, for instance, the Gaussian clustering problem is to assign points to clusters in such a way that the points in each cluster are independent given the cluster mean and variance. Since the Gaussian distribution is maxentropic subject to mean and variance constraints (i.e. it is the unique distribution for which mean and variance are sufficient statistics), this fully characterizes Gaussian clustering.
Generalizing to the “object type discovery” problem of abstraction, we want to find sets of chunks-of-the-world which are independent given some summary statistics of the chunks. So the analogy is quite strong - in fact, Bayesian clustering isn’t just an analogy, it’s an example of the problem (albeit with some additional assumptions typically thrown in, e.g. about the specific forms of the clusters).
BTW, if you buy this view and figure out a good way to explain it, you are more-than-welcome to take whatever you want from that draft and scoop me on it.
Your model uses correlational notions like “conditional independence” to make sense of it. But I think one could perhaps come with an alternate model using causal notions?
Specifically: If two variables X and Y are correlated, then they usually are so due to confounding, because there are a lot more ways that things can be confounded than that they can be causally related. So it makes sense to assume that they are confounded.
You could approximate all of the confounders of a suitably chosen set of observable variables by postulating a new variable, which affects all of the observables. This confounder then turns into your feature axis (if continuous) or cluster (if discrete).
So, I like this, but I’m still not sure I understand where features come from.
Say I’m an AI, and I’ve observed a bunch of sensor data that I’m representing internally as the points (6.94, 3.96), (1.44, -2.83), (5.04, 1.1), (0.07, -1.42), (-2.61, -0.21), (-2.33, 3.36), (-2.91, 2.43), (0.11, 0.76), (3.2, 1.32), (-0.43, -2.67).
The part where I look at this data and say, “Hey, these datapoints become approximately conditionally independent if I assume they were generated by a multivariate normal with mean (2, -1), and covariance matrix [[16, 0], [0, 9]][1]; let me allocate a new concept for that!” makes sense. (In the real world, I don’t know how to write a program to do this offhand, but I know how to find what textbook chapters to read to tell me how.)
But what about the part where my sensor data came to me already pre-processed into the list of 2-tuples?—how do I learn that? Is it just, like, whatever transformations of a big buffer of camera pixels let me find conditional independence patterns probably correspond to regularities in the real world? Is it “that easy”??
In the real world, I got those numbers from the Python expression ', '.join(str(d) for d in [(round(normal(2, 4), 2), round(normal(-1, 3), 2)) for _ in range(10)]) (using scipy.random.normal).
Is it just, like, whatever transformations of a big buffer of camera pixels let me find conditional independence patterns probably correspond to regularities in the real world? Is it “that easy”??
Roughly speaking, yes.
Features are then typically the summary statistics associated with some abstraction. So, we look for features which induce conditional independence patterns in the big buffer of camera pixels. Then, we look for higher-level features which induce conditional independence between those features. Etc.
It’s funny that you should mention this, because I’ve considered working on a machine learning system for image recognition using this principle. However, I don’t think this is necessarily all of it. I bet we come pre-baked with a lot of rules for what sorts of features to “look for”. To give an analogy to machine learning, there’s an algorithm called pi-GAN, which comes pre-baked with the assumption that pictures originate from 3D scenes, and which then manages to learn 3D scenes from the 2D images it is trained with. (Admittedly only when the images are particularly nice.)
Yup, this seems basically right.
I have a (still incomplete) draft here which specifically addresses why the configuration space metaphor works. Short version: the key property of (Bayesian) clustering is that the points in a cluster are conditionally independent given the summary data of the cluster. For instance, if I have Gaussian clusters, then each point within a given cluster is independent given the mean and variance of that cluster. The prototypical “clustering problem” is to assign points to clusters in such a way that this works. So, for instance, the Gaussian clustering problem is to assign points to clusters in such a way that the points in each cluster are independent given the cluster mean and variance. Since the Gaussian distribution is maxentropic subject to mean and variance constraints (i.e. it is the unique distribution for which mean and variance are sufficient statistics), this fully characterizes Gaussian clustering.
Generalizing to the “object type discovery” problem of abstraction, we want to find sets of chunks-of-the-world which are independent given some summary statistics of the chunks. So the analogy is quite strong - in fact, Bayesian clustering isn’t just an analogy, it’s an example of the problem (albeit with some additional assumptions typically thrown in, e.g. about the specific forms of the clusters).
BTW, if you buy this view and figure out a good way to explain it, you are more-than-welcome to take whatever you want from that draft and scoop me on it.
Your model uses correlational notions like “conditional independence” to make sense of it. But I think one could perhaps come with an alternate model using causal notions?
Specifically: If two variables X and Y are correlated, then they usually are so due to confounding, because there are a lot more ways that things can be confounded than that they can be causally related. So it makes sense to assume that they are confounded.
You could approximate all of the confounders of a suitably chosen set of observable variables by postulating a new variable, which affects all of the observables. This confounder then turns into your feature axis (if continuous) or cluster (if discrete).
This is exactly right; we can interpret the abstraction model essentially along these lines as well.
So, I like this, but I’m still not sure I understand where features come from.
Say I’m an AI, and I’ve observed a bunch of sensor data that I’m representing internally as the points
(6.94, 3.96), (1.44, -2.83), (5.04, 1.1), (0.07, -1.42), (-2.61, -0.21), (-2.33, 3.36), (-2.91, 2.43), (0.11, 0.76), (3.2, 1.32), (-0.43, -2.67)
.The part where I look at this data and say, “Hey, these datapoints become approximately conditionally independent if I assume they were generated by a multivariate normal with mean
(2, -1)
, and covariance matrix[[16, 0], [0, 9]]
[1]; let me allocate a new concept for that!” makes sense. (In the real world, I don’t know how to write a program to do this offhand, but I know how to find what textbook chapters to read to tell me how.)But what about the part where my sensor data came to me already pre-processed into the list of 2-tuples?—how do I learn that? Is it just, like, whatever transformations of a big buffer of camera pixels let me find conditional independence patterns probably correspond to regularities in the real world? Is it “that easy”??
In the real world, I got those numbers from the Python expression
', '.join(str(d) for d in [(round(normal(2, 4), 2), round(normal(-1, 3), 2)) for _ in range(10)])
(usingscipy.random.normal
).Roughly speaking, yes.
Features are then typically the summary statistics associated with some abstraction. So, we look for features which induce conditional independence patterns in the big buffer of camera pixels. Then, we look for higher-level features which induce conditional independence between those features. Etc.
This gave me a blog story idea!
“Feature Selection”
It’s funny that you should mention this, because I’ve considered working on a machine learning system for image recognition using this principle. However, I don’t think this is necessarily all of it. I bet we come pre-baked with a lot of rules for what sorts of features to “look for”. To give an analogy to machine learning, there’s an algorithm called pi-GAN, which comes pre-baked with the assumption that pictures originate from 3D scenes, and which then manages to learn 3D scenes from the 2D images it is trained with. (Admittedly only when the images are particularly nice.)