(Thinking out loud about how my categorization thing will end up relating to your abstraction thing …)
200-word recap of my thing: I’ve been relying on our standard configuration space metaphor, talking about running some “neutral” clustering algorithm on some choice of subspace (which is “value-laden” in the sense that what features you care about predicting depends on your values). This lets me explain how to think about dolphins: they simultaneously cluster with fish in one subspace, but also cluster with other mammals in a different subspace, no contradiction there. It also lets me explain what’s wrong with a fake promotion to “Vice President of Sorting”: the “what business cards say” dimension is a very “thin” subspace; if it doesn’t cluster with anything else, then there’s no reason we care. As my measurement of what makes a cluster “good”, I’m using the squared error, which is pretty “standard”—that’s basically what, say, k means clustering is doing—but also pretty ad hoc: I don’t have a proof of why squared error and only squared error is the right calculation to be doing given some simple deciderata, and it probably isn’t. (In contrast, we can prove that if you want a monotonic, nonnegative, additive measure of information, you end up with entropy: the only free choice is the base of the logarithm.)
What I’m hearing from the parent and your reply to my comment on ”… Ad Hoc Mathematical Definitions?”: talking about looking for clusters in some pre-chosen subspace of features is getting the actual AI challenge backwards. There are no pre-existing features in the territory; rather, conditional-independence structure in the territory is what lets us construct features such that there are clusters. Saying that we want categories that cluster in a “thick” subspace that covers many dimensions is like saying we want to measure information with “a bunch of functions like X2, sin(Y), eX+2X−1, &c., and require that those also be uncorrelated”: it probably works, but there has to be some deeper principle that explains why most of the dimensions and ad hoc information measures agree, why we can construct a “thick” subspace.
To explain why “squiggly”, “gerrymandered” categories are bad, I said that if you needed to make a decision that depended on how big an integer is, categorizing by parity would be bad: the squared-error score quantifies the fact that 2 is more similar to 3 than 12342. But notice that the choice of feature (the decision quality depending on magnitude, not parity) is doing all the work: 2 is more similar to 12342 than 3 in the mod-2 quotient space!
So maybe the exact measure of “closeness” in the space (squared error, or whatever) is a red herring, an uninteresting part of the problem?—like the choice of logarithm in the definition of entropy. We know that there isn’t any principled reason why base 2 or base e is better than any others. It’s just that we’re talking about how uncertainty relates to information, so if we use our standard representation of uncertainty as probabilities from 0 to 1 under which independent events multiply, then we have a homomorphism from multiplication (of probability) to addition (of information), which means you have to pick a base for the logarithm if you want to work with concrete numbers instead of abstract nonsense.
If this is a good analogy, then we’re looking for some sort of deeper theorem about “closeness” and conditional independence “and stuff” that explains why the configuration space metaphor works—after which we’ll be able to show that the choice of metric on the “space” will be knowably arbitrary??
I have a (still incomplete) draft here which specifically addresses why the configuration space metaphor works. Short version: the key property of (Bayesian) clustering is that the points in a cluster are conditionally independent given the summary data of the cluster. For instance, if I have Gaussian clusters, then each point within a given cluster is independent given the mean and variance of that cluster. The prototypical “clustering problem” is to assign points to clusters in such a way that this works. So, for instance, the Gaussian clustering problem is to assign points to clusters in such a way that the points in each cluster are independent given the cluster mean and variance. Since the Gaussian distribution is maxentropic subject to mean and variance constraints (i.e. it is the unique distribution for which mean and variance are sufficient statistics), this fully characterizes Gaussian clustering.
Generalizing to the “object type discovery” problem of abstraction, we want to find sets of chunks-of-the-world which are independent given some summary statistics of the chunks. So the analogy is quite strong - in fact, Bayesian clustering isn’t just an analogy, it’s an example of the problem (albeit with some additional assumptions typically thrown in, e.g. about the specific forms of the clusters).
BTW, if you buy this view and figure out a good way to explain it, you are more-than-welcome to take whatever you want from that draft and scoop me on it.
Your model uses correlational notions like “conditional independence” to make sense of it. But I think one could perhaps come with an alternate model using causal notions?
Specifically: If two variables X and Y are correlated, then they usually are so due to confounding, because there are a lot more ways that things can be confounded than that they can be causally related. So it makes sense to assume that they are confounded.
You could approximate all of the confounders of a suitably chosen set of observable variables by postulating a new variable, which affects all of the observables. This confounder then turns into your feature axis (if continuous) or cluster (if discrete).
So, I like this, but I’m still not sure I understand where features come from.
Say I’m an AI, and I’ve observed a bunch of sensor data that I’m representing internally as the points (6.94, 3.96), (1.44, -2.83), (5.04, 1.1), (0.07, -1.42), (-2.61, -0.21), (-2.33, 3.36), (-2.91, 2.43), (0.11, 0.76), (3.2, 1.32), (-0.43, -2.67).
The part where I look at this data and say, “Hey, these datapoints become approximately conditionally independent if I assume they were generated by a multivariate normal with mean (2, -1), and covariance matrix [[16, 0], [0, 9]][1]; let me allocate a new concept for that!” makes sense. (In the real world, I don’t know how to write a program to do this offhand, but I know how to find what textbook chapters to read to tell me how.)
But what about the part where my sensor data came to me already pre-processed into the list of 2-tuples?—how do I learn that? Is it just, like, whatever transformations of a big buffer of camera pixels let me find conditional independence patterns probably correspond to regularities in the real world? Is it “that easy”??
In the real world, I got those numbers from the Python expression ', '.join(str(d) for d in [(round(normal(2, 4), 2), round(normal(-1, 3), 2)) for _ in range(10)]) (using scipy.random.normal).
Is it just, like, whatever transformations of a big buffer of camera pixels let me find conditional independence patterns probably correspond to regularities in the real world? Is it “that easy”??
Roughly speaking, yes.
Features are then typically the summary statistics associated with some abstraction. So, we look for features which induce conditional independence patterns in the big buffer of camera pixels. Then, we look for higher-level features which induce conditional independence between those features. Etc.
It’s funny that you should mention this, because I’ve considered working on a machine learning system for image recognition using this principle. However, I don’t think this is necessarily all of it. I bet we come pre-baked with a lot of rules for what sorts of features to “look for”. To give an analogy to machine learning, there’s an algorithm called pi-GAN, which comes pre-baked with the assumption that pictures originate from 3D scenes, and which then manages to learn 3D scenes from the 2D images it is trained with. (Admittedly only when the images are particularly nice.)
(Thinking out loud about how my categorization thing will end up relating to your abstraction thing …)
200-word recap of my thing: I’ve been relying on our standard configuration space metaphor, talking about running some “neutral” clustering algorithm on some choice of subspace (which is “value-laden” in the sense that what features you care about predicting depends on your values). This lets me explain how to think about dolphins: they simultaneously cluster with fish in one subspace, but also cluster with other mammals in a different subspace, no contradiction there. It also lets me explain what’s wrong with a fake promotion to “Vice President of Sorting”: the “what business cards say” dimension is a very “thin” subspace; if it doesn’t cluster with anything else, then there’s no reason we care. As my measurement of what makes a cluster “good”, I’m using the squared error, which is pretty “standard”—that’s basically what, say, k means clustering is doing—but also pretty ad hoc: I don’t have a proof of why squared error and only squared error is the right calculation to be doing given some simple deciderata, and it probably isn’t. (In contrast, we can prove that if you want a monotonic, nonnegative, additive measure of information, you end up with entropy: the only free choice is the base of the logarithm.)
What I’m hearing from the parent and your reply to my comment on ”… Ad Hoc Mathematical Definitions?”: talking about looking for clusters in some pre-chosen subspace of features is getting the actual AI challenge backwards. There are no pre-existing features in the territory; rather, conditional-independence structure in the territory is what lets us construct features such that there are clusters. Saying that we want categories that cluster in a “thick” subspace that covers many dimensions is like saying we want to measure information with “a bunch of functions like X2, sin(Y), eX+2X−1, &c., and require that those also be uncorrelated”: it probably works, but there has to be some deeper principle that explains why most of the dimensions and ad hoc information measures agree, why we can construct a “thick” subspace.
To explain why “squiggly”, “gerrymandered” categories are bad, I said that if you needed to make a decision that depended on how big an integer is, categorizing by parity would be bad: the squared-error score quantifies the fact that 2 is more similar to 3 than 12342. But notice that the choice of feature (the decision quality depending on magnitude, not parity) is doing all the work: 2 is more similar to 12342 than 3 in the mod-2 quotient space!
So maybe the exact measure of “closeness” in the space (squared error, or whatever) is a red herring, an uninteresting part of the problem?—like the choice of logarithm in the definition of entropy. We know that there isn’t any principled reason why base 2 or base e is better than any others. It’s just that we’re talking about how uncertainty relates to information, so if we use our standard representation of uncertainty as probabilities from 0 to 1 under which independent events multiply, then we have a homomorphism from multiplication (of probability) to addition (of information), which means you have to pick a base for the logarithm if you want to work with concrete numbers instead of abstract nonsense.
If this is a good analogy, then we’re looking for some sort of deeper theorem about “closeness” and conditional independence “and stuff” that explains why the configuration space metaphor works—after which we’ll be able to show that the choice of metric on the “space” will be knowably arbitrary??
Yup, this seems basically right.
I have a (still incomplete) draft here which specifically addresses why the configuration space metaphor works. Short version: the key property of (Bayesian) clustering is that the points in a cluster are conditionally independent given the summary data of the cluster. For instance, if I have Gaussian clusters, then each point within a given cluster is independent given the mean and variance of that cluster. The prototypical “clustering problem” is to assign points to clusters in such a way that this works. So, for instance, the Gaussian clustering problem is to assign points to clusters in such a way that the points in each cluster are independent given the cluster mean and variance. Since the Gaussian distribution is maxentropic subject to mean and variance constraints (i.e. it is the unique distribution for which mean and variance are sufficient statistics), this fully characterizes Gaussian clustering.
Generalizing to the “object type discovery” problem of abstraction, we want to find sets of chunks-of-the-world which are independent given some summary statistics of the chunks. So the analogy is quite strong - in fact, Bayesian clustering isn’t just an analogy, it’s an example of the problem (albeit with some additional assumptions typically thrown in, e.g. about the specific forms of the clusters).
BTW, if you buy this view and figure out a good way to explain it, you are more-than-welcome to take whatever you want from that draft and scoop me on it.
Your model uses correlational notions like “conditional independence” to make sense of it. But I think one could perhaps come with an alternate model using causal notions?
Specifically: If two variables X and Y are correlated, then they usually are so due to confounding, because there are a lot more ways that things can be confounded than that they can be causally related. So it makes sense to assume that they are confounded.
You could approximate all of the confounders of a suitably chosen set of observable variables by postulating a new variable, which affects all of the observables. This confounder then turns into your feature axis (if continuous) or cluster (if discrete).
This is exactly right; we can interpret the abstraction model essentially along these lines as well.
So, I like this, but I’m still not sure I understand where features come from.
Say I’m an AI, and I’ve observed a bunch of sensor data that I’m representing internally as the points
(6.94, 3.96), (1.44, -2.83), (5.04, 1.1), (0.07, -1.42), (-2.61, -0.21), (-2.33, 3.36), (-2.91, 2.43), (0.11, 0.76), (3.2, 1.32), (-0.43, -2.67)
.The part where I look at this data and say, “Hey, these datapoints become approximately conditionally independent if I assume they were generated by a multivariate normal with mean
(2, -1)
, and covariance matrix[[16, 0], [0, 9]]
[1]; let me allocate a new concept for that!” makes sense. (In the real world, I don’t know how to write a program to do this offhand, but I know how to find what textbook chapters to read to tell me how.)But what about the part where my sensor data came to me already pre-processed into the list of 2-tuples?—how do I learn that? Is it just, like, whatever transformations of a big buffer of camera pixels let me find conditional independence patterns probably correspond to regularities in the real world? Is it “that easy”??
In the real world, I got those numbers from the Python expression
', '.join(str(d) for d in [(round(normal(2, 4), 2), round(normal(-1, 3), 2)) for _ in range(10)])
(usingscipy.random.normal
).Roughly speaking, yes.
Features are then typically the summary statistics associated with some abstraction. So, we look for features which induce conditional independence patterns in the big buffer of camera pixels. Then, we look for higher-level features which induce conditional independence between those features. Etc.
This gave me a blog story idea!
“Feature Selection”
It’s funny that you should mention this, because I’ve considered working on a machine learning system for image recognition using this principle. However, I don’t think this is necessarily all of it. I bet we come pre-baked with a lot of rules for what sorts of features to “look for”. To give an analogy to machine learning, there’s an algorithm called pi-GAN, which comes pre-baked with the assumption that pictures originate from 3D scenes, and which then manages to learn 3D scenes from the 2D images it is trained with. (Admittedly only when the images are particularly nice.)