This comment is mostly a placeholder for a post I owe you on how words work, and in particular the “rules of the game” for socially-constructed categories. I’ll just give a few quick high-level descriptions on how the models in that eventual post diverge from the models here. Apologies in advance for not-very-good explanations.
First, “clusters in thingspace” is a metaphor. There isn’t really a canonical “thingspace” with pre-defined features along the axes; figuring out what features to use is most of the problem from the start. So, what is the (mathematical) concept for which “clusters in thingspace” is a metaphor? I think the main answer here is conditional independence. The defining feature of a cluster (under the usual Bayesian setup) is that the points comprising the cluster are independent given the summary statistics of the cluster itself. Likewise for the concepts to which we attach words: the concept-of-tree contains/points to all the summary data about particular trees which is relevant to other trees, making each tree independent given the tree-summary-statistics. For complicated real-world objects like trees, those summary statistics are high-dimensional and no human even knows all of them, but they’re still a lot lower dimensional then all the atoms in any particular tree.
Now, we combine that with the natural abstractions hypothesis, which in this context basically says that natural-concept-space is (approximately) discrete rather than continuous. Natural concepts are not arbitrarily close together. If the concept of “tree” points to the summary statistics of bunch of correlated chunks of the world, inducing independence between those chunks with minimal extraneous info, then there is not another arbitrarily-close summary/concept which also induces independence with minimal extraneous info. This is the key property which lets people reasonably-confidently “talk about the same thing” without needing infinitely many examples to coordinate.
So those are the rough rules governing words.
To talk about “rules of the game” for socially-constructed categories, the next step is Parable of the Dammed—I had intended to include that in the eventual post on how words work, but ended up spinning it off. The main idea there is that people can move around Schelling points by changing the underlying territory. In the case of words, the Schelling points are the natural concepts—minimal summary data which induces independence between different chunks of the world. So, the “game” is to move around those natural concepts—i.e. make some other information necessary to induce independence between world-chunks.
In particular, an interesting way to do this is to create new chunks which partially match the old natural concept. In the cluster analogy, this would mean adding new points within the cluster but biased toward one particular side. (In the trees example: we could imagine creating new tree species, or driving old tree species to extinction, or dramatically shifting the mix of trees.) Then humans get to debate whether these new points should “count” or not—the old points are still independent under the old concept-definition, but making the new points also independent requires adding new information to the concept, and some might even advocate for ignoring some of the old points and just making the concept induce independence on the new points.
As for unnatural categories...
Once we have the idea that natural concept space is discrete, and natural concepts are Schelling points for words, then the questions around unnatural categories are:
is the unnatural category a Schelling point at all?
if so, how are people recognizing the Schelling point, rather than as a natural concept?
In some cases, the unnatural category might just not be a Schelling point at all. We could imagine a variant of the gold/silver bars example where people have actually-different ideas of what the words mean, so maybe there just isn’t a Schelling point and people entering contracts will need to meet the legal requirements for a “meeting of minds” some other way—i.e. writing definitions out in excruciating detail.
Alternatively, the Schelling point can be established via some mechanism other than natural concepts—e.g. passing laws about what a word means, establishing norms, etc. (Though note that these mechanisms still need some way of dealing with the very high dimensionality of word-specification space; there still needs to be some efficient way to coordinate on a high-dimensional word-meaning, so it will probably eventually ground out in other natural concepts.)
(Thinking out loud about how my categorization thing will end up relating to your abstraction thing …)
200-word recap of my thing: I’ve been relying on our standard configuration space metaphor, talking about running some “neutral” clustering algorithm on some choice of subspace (which is “value-laden” in the sense that what features you care about predicting depends on your values). This lets me explain how to think about dolphins: they simultaneously cluster with fish in one subspace, but also cluster with other mammals in a different subspace, no contradiction there. It also lets me explain what’s wrong with a fake promotion to “Vice President of Sorting”: the “what business cards say” dimension is a very “thin” subspace; if it doesn’t cluster with anything else, then there’s no reason we care. As my measurement of what makes a cluster “good”, I’m using the squared error, which is pretty “standard”—that’s basically what, say, k means clustering is doing—but also pretty ad hoc: I don’t have a proof of why squared error and only squared error is the right calculation to be doing given some simple deciderata, and it probably isn’t. (In contrast, we can prove that if you want a monotonic, nonnegative, additive measure of information, you end up with entropy: the only free choice is the base of the logarithm.)
What I’m hearing from the parent and your reply to my comment on ”… Ad Hoc Mathematical Definitions?”: talking about looking for clusters in some pre-chosen subspace of features is getting the actual AI challenge backwards. There are no pre-existing features in the territory; rather, conditional-independence structure in the territory is what lets us construct features such that there are clusters. Saying that we want categories that cluster in a “thick” subspace that covers many dimensions is like saying we want to measure information with “a bunch of functions like X2, sin(Y), eX+2X−1, &c., and require that those also be uncorrelated”: it probably works, but there has to be some deeper principle that explains why most of the dimensions and ad hoc information measures agree, why we can construct a “thick” subspace.
To explain why “squiggly”, “gerrymandered” categories are bad, I said that if you needed to make a decision that depended on how big an integer is, categorizing by parity would be bad: the squared-error score quantifies the fact that 2 is more similar to 3 than 12342. But notice that the choice of feature (the decision quality depending on magnitude, not parity) is doing all the work: 2 is more similar to 12342 than 3 in the mod-2 quotient space!
So maybe the exact measure of “closeness” in the space (squared error, or whatever) is a red herring, an uninteresting part of the problem?—like the choice of logarithm in the definition of entropy. We know that there isn’t any principled reason why base 2 or base e is better than any others. It’s just that we’re talking about how uncertainty relates to information, so if we use our standard representation of uncertainty as probabilities from 0 to 1 under which independent events multiply, then we have a homomorphism from multiplication (of probability) to addition (of information), which means you have to pick a base for the logarithm if you want to work with concrete numbers instead of abstract nonsense.
If this is a good analogy, then we’re looking for some sort of deeper theorem about “closeness” and conditional independence “and stuff” that explains why the configuration space metaphor works—after which we’ll be able to show that the choice of metric on the “space” will be knowably arbitrary??
I have a (still incomplete) draft here which specifically addresses why the configuration space metaphor works. Short version: the key property of (Bayesian) clustering is that the points in a cluster are conditionally independent given the summary data of the cluster. For instance, if I have Gaussian clusters, then each point within a given cluster is independent given the mean and variance of that cluster. The prototypical “clustering problem” is to assign points to clusters in such a way that this works. So, for instance, the Gaussian clustering problem is to assign points to clusters in such a way that the points in each cluster are independent given the cluster mean and variance. Since the Gaussian distribution is maxentropic subject to mean and variance constraints (i.e. it is the unique distribution for which mean and variance are sufficient statistics), this fully characterizes Gaussian clustering.
Generalizing to the “object type discovery” problem of abstraction, we want to find sets of chunks-of-the-world which are independent given some summary statistics of the chunks. So the analogy is quite strong - in fact, Bayesian clustering isn’t just an analogy, it’s an example of the problem (albeit with some additional assumptions typically thrown in, e.g. about the specific forms of the clusters).
BTW, if you buy this view and figure out a good way to explain it, you are more-than-welcome to take whatever you want from that draft and scoop me on it.
Your model uses correlational notions like “conditional independence” to make sense of it. But I think one could perhaps come with an alternate model using causal notions?
Specifically: If two variables X and Y are correlated, then they usually are so due to confounding, because there are a lot more ways that things can be confounded than that they can be causally related. So it makes sense to assume that they are confounded.
You could approximate all of the confounders of a suitably chosen set of observable variables by postulating a new variable, which affects all of the observables. This confounder then turns into your feature axis (if continuous) or cluster (if discrete).
So, I like this, but I’m still not sure I understand where features come from.
Say I’m an AI, and I’ve observed a bunch of sensor data that I’m representing internally as the points (6.94, 3.96), (1.44, -2.83), (5.04, 1.1), (0.07, -1.42), (-2.61, -0.21), (-2.33, 3.36), (-2.91, 2.43), (0.11, 0.76), (3.2, 1.32), (-0.43, -2.67).
The part where I look at this data and say, “Hey, these datapoints become approximately conditionally independent if I assume they were generated by a multivariate normal with mean (2, -1), and covariance matrix [[16, 0], [0, 9]][1]; let me allocate a new concept for that!” makes sense. (In the real world, I don’t know how to write a program to do this offhand, but I know how to find what textbook chapters to read to tell me how.)
But what about the part where my sensor data came to me already pre-processed into the list of 2-tuples?—how do I learn that? Is it just, like, whatever transformations of a big buffer of camera pixels let me find conditional independence patterns probably correspond to regularities in the real world? Is it “that easy”??
In the real world, I got those numbers from the Python expression ', '.join(str(d) for d in [(round(normal(2, 4), 2), round(normal(-1, 3), 2)) for _ in range(10)]) (using scipy.random.normal).
Is it just, like, whatever transformations of a big buffer of camera pixels let me find conditional independence patterns probably correspond to regularities in the real world? Is it “that easy”??
Roughly speaking, yes.
Features are then typically the summary statistics associated with some abstraction. So, we look for features which induce conditional independence patterns in the big buffer of camera pixels. Then, we look for higher-level features which induce conditional independence between those features. Etc.
It’s funny that you should mention this, because I’ve considered working on a machine learning system for image recognition using this principle. However, I don’t think this is necessarily all of it. I bet we come pre-baked with a lot of rules for what sorts of features to “look for”. To give an analogy to machine learning, there’s an algorithm called pi-GAN, which comes pre-baked with the assumption that pictures originate from 3D scenes, and which then manages to learn 3D scenes from the 2D images it is trained with. (Admittedly only when the images are particularly nice.)
This comment is mostly a placeholder for a post I owe you on how words work, and in particular the “rules of the game” for socially-constructed categories. I’ll just give a few quick high-level descriptions on how the models in that eventual post diverge from the models here. Apologies in advance for not-very-good explanations.
First, “clusters in thingspace” is a metaphor. There isn’t really a canonical “thingspace” with pre-defined features along the axes; figuring out what features to use is most of the problem from the start. So, what is the (mathematical) concept for which “clusters in thingspace” is a metaphor? I think the main answer here is conditional independence. The defining feature of a cluster (under the usual Bayesian setup) is that the points comprising the cluster are independent given the summary statistics of the cluster itself. Likewise for the concepts to which we attach words: the concept-of-tree contains/points to all the summary data about particular trees which is relevant to other trees, making each tree independent given the tree-summary-statistics. For complicated real-world objects like trees, those summary statistics are high-dimensional and no human even knows all of them, but they’re still a lot lower dimensional then all the atoms in any particular tree.
Now, we combine that with the natural abstractions hypothesis, which in this context basically says that natural-concept-space is (approximately) discrete rather than continuous. Natural concepts are not arbitrarily close together. If the concept of “tree” points to the summary statistics of bunch of correlated chunks of the world, inducing independence between those chunks with minimal extraneous info, then there is not another arbitrarily-close summary/concept which also induces independence with minimal extraneous info. This is the key property which lets people reasonably-confidently “talk about the same thing” without needing infinitely many examples to coordinate.
So those are the rough rules governing words.
To talk about “rules of the game” for socially-constructed categories, the next step is Parable of the Dammed—I had intended to include that in the eventual post on how words work, but ended up spinning it off. The main idea there is that people can move around Schelling points by changing the underlying territory. In the case of words, the Schelling points are the natural concepts—minimal summary data which induces independence between different chunks of the world. So, the “game” is to move around those natural concepts—i.e. make some other information necessary to induce independence between world-chunks.
In particular, an interesting way to do this is to create new chunks which partially match the old natural concept. In the cluster analogy, this would mean adding new points within the cluster but biased toward one particular side. (In the trees example: we could imagine creating new tree species, or driving old tree species to extinction, or dramatically shifting the mix of trees.) Then humans get to debate whether these new points should “count” or not—the old points are still independent under the old concept-definition, but making the new points also independent requires adding new information to the concept, and some might even advocate for ignoring some of the old points and just making the concept induce independence on the new points.
As for unnatural categories...
Once we have the idea that natural concept space is discrete, and natural concepts are Schelling points for words, then the questions around unnatural categories are:
is the unnatural category a Schelling point at all?
if so, how are people recognizing the Schelling point, rather than as a natural concept?
In some cases, the unnatural category might just not be a Schelling point at all. We could imagine a variant of the gold/silver bars example where people have actually-different ideas of what the words mean, so maybe there just isn’t a Schelling point and people entering contracts will need to meet the legal requirements for a “meeting of minds” some other way—i.e. writing definitions out in excruciating detail.
Alternatively, the Schelling point can be established via some mechanism other than natural concepts—e.g. passing laws about what a word means, establishing norms, etc. (Though note that these mechanisms still need some way of dealing with the very high dimensionality of word-specification space; there still needs to be some efficient way to coordinate on a high-dimensional word-meaning, so it will probably eventually ground out in other natural concepts.)
(Thinking out loud about how my categorization thing will end up relating to your abstraction thing …)
200-word recap of my thing: I’ve been relying on our standard configuration space metaphor, talking about running some “neutral” clustering algorithm on some choice of subspace (which is “value-laden” in the sense that what features you care about predicting depends on your values). This lets me explain how to think about dolphins: they simultaneously cluster with fish in one subspace, but also cluster with other mammals in a different subspace, no contradiction there. It also lets me explain what’s wrong with a fake promotion to “Vice President of Sorting”: the “what business cards say” dimension is a very “thin” subspace; if it doesn’t cluster with anything else, then there’s no reason we care. As my measurement of what makes a cluster “good”, I’m using the squared error, which is pretty “standard”—that’s basically what, say, k means clustering is doing—but also pretty ad hoc: I don’t have a proof of why squared error and only squared error is the right calculation to be doing given some simple deciderata, and it probably isn’t. (In contrast, we can prove that if you want a monotonic, nonnegative, additive measure of information, you end up with entropy: the only free choice is the base of the logarithm.)
What I’m hearing from the parent and your reply to my comment on ”… Ad Hoc Mathematical Definitions?”: talking about looking for clusters in some pre-chosen subspace of features is getting the actual AI challenge backwards. There are no pre-existing features in the territory; rather, conditional-independence structure in the territory is what lets us construct features such that there are clusters. Saying that we want categories that cluster in a “thick” subspace that covers many dimensions is like saying we want to measure information with “a bunch of functions like X2, sin(Y), eX+2X−1, &c., and require that those also be uncorrelated”: it probably works, but there has to be some deeper principle that explains why most of the dimensions and ad hoc information measures agree, why we can construct a “thick” subspace.
To explain why “squiggly”, “gerrymandered” categories are bad, I said that if you needed to make a decision that depended on how big an integer is, categorizing by parity would be bad: the squared-error score quantifies the fact that 2 is more similar to 3 than 12342. But notice that the choice of feature (the decision quality depending on magnitude, not parity) is doing all the work: 2 is more similar to 12342 than 3 in the mod-2 quotient space!
So maybe the exact measure of “closeness” in the space (squared error, or whatever) is a red herring, an uninteresting part of the problem?—like the choice of logarithm in the definition of entropy. We know that there isn’t any principled reason why base 2 or base e is better than any others. It’s just that we’re talking about how uncertainty relates to information, so if we use our standard representation of uncertainty as probabilities from 0 to 1 under which independent events multiply, then we have a homomorphism from multiplication (of probability) to addition (of information), which means you have to pick a base for the logarithm if you want to work with concrete numbers instead of abstract nonsense.
If this is a good analogy, then we’re looking for some sort of deeper theorem about “closeness” and conditional independence “and stuff” that explains why the configuration space metaphor works—after which we’ll be able to show that the choice of metric on the “space” will be knowably arbitrary??
Yup, this seems basically right.
I have a (still incomplete) draft here which specifically addresses why the configuration space metaphor works. Short version: the key property of (Bayesian) clustering is that the points in a cluster are conditionally independent given the summary data of the cluster. For instance, if I have Gaussian clusters, then each point within a given cluster is independent given the mean and variance of that cluster. The prototypical “clustering problem” is to assign points to clusters in such a way that this works. So, for instance, the Gaussian clustering problem is to assign points to clusters in such a way that the points in each cluster are independent given the cluster mean and variance. Since the Gaussian distribution is maxentropic subject to mean and variance constraints (i.e. it is the unique distribution for which mean and variance are sufficient statistics), this fully characterizes Gaussian clustering.
Generalizing to the “object type discovery” problem of abstraction, we want to find sets of chunks-of-the-world which are independent given some summary statistics of the chunks. So the analogy is quite strong - in fact, Bayesian clustering isn’t just an analogy, it’s an example of the problem (albeit with some additional assumptions typically thrown in, e.g. about the specific forms of the clusters).
BTW, if you buy this view and figure out a good way to explain it, you are more-than-welcome to take whatever you want from that draft and scoop me on it.
Your model uses correlational notions like “conditional independence” to make sense of it. But I think one could perhaps come with an alternate model using causal notions?
Specifically: If two variables X and Y are correlated, then they usually are so due to confounding, because there are a lot more ways that things can be confounded than that they can be causally related. So it makes sense to assume that they are confounded.
You could approximate all of the confounders of a suitably chosen set of observable variables by postulating a new variable, which affects all of the observables. This confounder then turns into your feature axis (if continuous) or cluster (if discrete).
This is exactly right; we can interpret the abstraction model essentially along these lines as well.
So, I like this, but I’m still not sure I understand where features come from.
Say I’m an AI, and I’ve observed a bunch of sensor data that I’m representing internally as the points
(6.94, 3.96), (1.44, -2.83), (5.04, 1.1), (0.07, -1.42), (-2.61, -0.21), (-2.33, 3.36), (-2.91, 2.43), (0.11, 0.76), (3.2, 1.32), (-0.43, -2.67)
.The part where I look at this data and say, “Hey, these datapoints become approximately conditionally independent if I assume they were generated by a multivariate normal with mean
(2, -1)
, and covariance matrix[[16, 0], [0, 9]]
[1]; let me allocate a new concept for that!” makes sense. (In the real world, I don’t know how to write a program to do this offhand, but I know how to find what textbook chapters to read to tell me how.)But what about the part where my sensor data came to me already pre-processed into the list of 2-tuples?—how do I learn that? Is it just, like, whatever transformations of a big buffer of camera pixels let me find conditional independence patterns probably correspond to regularities in the real world? Is it “that easy”??
In the real world, I got those numbers from the Python expression
', '.join(str(d) for d in [(round(normal(2, 4), 2), round(normal(-1, 3), 2)) for _ in range(10)])
(usingscipy.random.normal
).Roughly speaking, yes.
Features are then typically the summary statistics associated with some abstraction. So, we look for features which induce conditional independence patterns in the big buffer of camera pixels. Then, we look for higher-level features which induce conditional independence between those features. Etc.
This gave me a blog story idea!
“Feature Selection”
It’s funny that you should mention this, because I’ve considered working on a machine learning system for image recognition using this principle. However, I don’t think this is necessarily all of it. I bet we come pre-baked with a lot of rules for what sorts of features to “look for”. To give an analogy to machine learning, there’s an algorithm called pi-GAN, which comes pre-baked with the assumption that pictures originate from 3D scenes, and which then manages to learn 3D scenes from the 2D images it is trained with. (Admittedly only when the images are particularly nice.)