This is related to something I never quite figured out in my cognitive-function-of-categorization quest. How do we quantify how good a category is at “carving reality at the joints”?
Your first guess would be “mutual information between the category-label and the features you care about” (as suggested in the Job parable in April 2019′s “Where to Draw the Boundaries?”), but that actually turns out to be wrong, because information theory has no way to give you “partial credit” for getting close to the right answer, which we want. Learning whether a number between 1 and 10 inclusive is even or odd gives you the same amount of information (1 bit) as learning whether it’s over or under 5½, but if you need to make a decision whose goodness depends continuously on the magnitude of the number, then the high/low category system is useful and the even/odd system is not: we care about putting probability-mass “close” to the right answer, not just assigning more probability to the exact answer.
In January 2021′s “Unnatural Categories Are Optimized for Deception”, I ended up going with “minimize expected squared error (given some metric on the space of features you care about)”, which seems to work, but I didn’t have a principled justification for that choice, other than it solving my partial-credit problem and it being traditional. (Why not the absolute error? Why not exponentiate this feature and then, &c.?)
My intuition doesn’t seem adequate to determine which (or something else) formalization captures the true nature of category-goodness, to which other ideas are a mere proxy.
(This is a good example. I’m now going to go on a tangent mostly unrelated to the post.)
I think you were on the right track with mutual information. They key insight here is not an insight about what metric to use, it’s an insight about the structure of the world and our information about the world.
Let’s use this example:
Learning whether a number between 1 and 10 inclusive is even or odd gives you the same amount of information (1 bit) as learning whether it’s over or under 5½, but if you need to make a decision whose goodness depends continuously on the magnitude of the number, then the high/low category system is useful and the even/odd system is not...
Why do we care about how big this number is, as opposed to even/odd? Let’s make the example a bit more concrete: we have a building whose walls are built from bricks, and the number of interest is how-many-bricks-tall the walls are. (Or, if you want to be really concrete, assume the wall is made of concrete blocks rather than bricks.)
Key thing to notice: in general, a lot more things are going to depend on the rough height of the wall than on the parity of the bricks, especially things far away from the wall itself (i.e. not just the motions of air molecules right next to the wall). It’s the rough height (i.e. the most-significant-bits) which is relevant to things like whether a tall person will hit their head on the ceiling, whether a bookshelf will fit, whether the building casts a shadow on the neighbor’s yard, whether I can see the building from somewhere far away, etc. By contrast, brick-parity is much less relevant to things elsewhere. If a wall is 457 inches tall, then the “4” in the hundreds place gives us more information about more other things in the world than the “7″ in the ones place.
Generalizing the idea: it’s not that we care directly about how-many-bricks-tall a wall is. That is not a terminal value. If we care more about the rough wall-height than about brick-parity, that’s because the rough wall-height is more relevant to the other things which we care about in the world. And that, in turn, is because the rough wall-height is more relevant to more things in general. Information about brick-parity just doesn’t propagate very far in the causal graph of the world; it’s quickly wiped out by noise in other variables. Rough wall-height propagates further.
No, its not just about the information, its about information, our utility function, and our epistemic capabilities. Suppose I had taken ultra high resolution electron microscope images of one particular brick in the wall. And burried the hard drives on the moon. Most of the information about the wall that isn’t near the wall is the hard drives. But if you are trying to reach the top, and want to know how big a ladder to get, you still don’t care about my electron microscope images.
Humans don’t track the entire causal graph. We just track the fragments that are most important to achieving our utility function, given our mental limitations. A superintelligent AI might be able to track consequences of brick parity all over the place. All we know is that we can’t track it very far. If we are too far from the wall to see the brick parity, we can’t track it.
Information about brick-parity just doesn’t propagate very far in the causal graph of the world; it’s quickly wiped out by noise in other variables.
How do you distinguish the info not being there, from you being unable to see it? A function can be perfectly deterministic, but seem random to you because you can’t compute it.
The problem with the hard-drive example is that the information is only on that one hard drive, buried somewhere on the moon. It’s not about how much information is relevant far away, it’s about how many different far-away places the information is relevant. Information which is relevant to many different neighborhoods of far-away variables is more likely to be relevant to something humans care about (because it’s relevant to many things); information which is relevant to only a few far-away chunks of variables is less likely to touch anything humans care about.
What makes wall-height interesting is that it’s relevant to a lot of different variables in the world—or, equivalently, we can learn something about the wall-height by observing many different things from many different places. If I’m standing on the lawn next door, look down and see the building’s shadow, then I’ve gained info about the building height. If I’m looking at the block from far away, and see the building over the surrounding buildings, I’ve learned something about the height. If I’m moving a couch around inside the building, and find that I have enough space to stand the couch on its end, then I’ve learned something about the height.
To put it differently: I can learn about the height from many different vantage points.
A toy model I use to study this sort of thing: we have a sparse causal network of normal variables. Pick one neighborhood of variables in this network, and calculate what it tells you about the variables in some other neighborhood elsewhere in the network. The main empirical result is that, if we fix one neighborhood X and ask what information we can gain about X by examining many different neighborhoods Y1,Y2,..., then it turns out that most of the neighborhoods Y contain approximately-the-same information about X. (Specifically: we can apply a singular vector decomposition to the covariance matrix of X with each of the Y’s, and it turns out that it’s usually low-rank and that the X-side singular vectors are approximately the same for a wide variety of Y’s.) I’ll have a post on this at some point.
In the hard drive example, the information is only in one little chunk of the world. (Well, two little chunks: the hard drive and the original brick.) By contrast, information about the wall height is contained in a wide(r) variety of other variables in other places.
How do you distinguish the info not being there, from you being unable to see it? A function can be perfectly deterministic, but seem random to you because you can’t compute it.
Well, at least in the toy models, I have can calculate exactly what information is available, and I do expect the key assumptions of these toy models to carry over to the real world. More generally, for chaotic systems (including e.g. motions of air molecules), we know that information is quickly wiped out given any uncertainty at all in the initial conditions.
If my only evidence were “it looks random”, then yes, I’d agree that’s weak evidence. Things we don’t understand look random, not mysterious. But we do have theory backing up the idea that information is quickly wiped out in the real world, given even very small uncertainty in initial conditions.
“partial credit” for getting close to the right answer”—this is where probability theory comes into play: “What is the likelihood of the enemy reading the message if it’s intercepted in fullness”.
Also, the concept which helps me to think about how to bring abstract concepts into reality is the time or any notion of temporal dependencies, which probability theory missed at large for a while and now recovering: see nature article.
This is related to something I never quite figured out in my cognitive-function-of-categorization quest. How do we quantify how good a category is at “carving reality at the joints”?
Your first guess would be “mutual information between the category-label and the features you care about” (as suggested in the Job parable in April 2019′s “Where to Draw the Boundaries?”), but that actually turns out to be wrong, because information theory has no way to give you “partial credit” for getting close to the right answer, which we want. Learning whether a number between 1 and 10 inclusive is even or odd gives you the same amount of information (1 bit) as learning whether it’s over or under 5½, but if you need to make a decision whose goodness depends continuously on the magnitude of the number, then the high/low category system is useful and the even/odd system is not: we care about putting probability-mass “close” to the right answer, not just assigning more probability to the exact answer.
In January 2021′s “Unnatural Categories Are Optimized for Deception”, I ended up going with “minimize expected squared error (given some metric on the space of features you care about)”, which seems to work, but I didn’t have a principled justification for that choice, other than it solving my partial-credit problem and it being traditional. (Why not the absolute error? Why not exponentiate this feature and then, &c.?)
Another possibility might have been to do something with the Wasserstein metric, which reportedly fixes the problem of information theory not being able to award “partial credit”. (The logarithmic score is the special case of the Kullback–Leibler divergence when the first distribution assigns Probability One to the actual answer, so if there’s some sense in which Wasserstein generalizes Kullback–Leibler for partial credit, then maybe that’s what I want.)
My intuition doesn’t seem adequate to determine which (or something else) formalization captures the true nature of category-goodness, to which other ideas are a mere proxy.
(This is a good example. I’m now going to go on a tangent mostly unrelated to the post.)
I think you were on the right track with mutual information. They key insight here is not an insight about what metric to use, it’s an insight about the structure of the world and our information about the world.
Let’s use this example:
Why do we care about how big this number is, as opposed to even/odd? Let’s make the example a bit more concrete: we have a building whose walls are built from bricks, and the number of interest is how-many-bricks-tall the walls are. (Or, if you want to be really concrete, assume the wall is made of concrete blocks rather than bricks.)
Key thing to notice: in general, a lot more things are going to depend on the rough height of the wall than on the parity of the bricks, especially things far away from the wall itself (i.e. not just the motions of air molecules right next to the wall). It’s the rough height (i.e. the most-significant-bits) which is relevant to things like whether a tall person will hit their head on the ceiling, whether a bookshelf will fit, whether the building casts a shadow on the neighbor’s yard, whether I can see the building from somewhere far away, etc. By contrast, brick-parity is much less relevant to things elsewhere. If a wall is 457 inches tall, then the “4” in the hundreds place gives us more information about more other things in the world than the “7″ in the ones place.
Generalizing the idea: it’s not that we care directly about how-many-bricks-tall a wall is. That is not a terminal value. If we care more about the rough wall-height than about brick-parity, that’s because the rough wall-height is more relevant to the other things which we care about in the world. And that, in turn, is because the rough wall-height is more relevant to more things in general. Information about brick-parity just doesn’t propagate very far in the causal graph of the world; it’s quickly wiped out by noise in other variables. Rough wall-height propagates further.
No, its not just about the information, its about information, our utility function, and our epistemic capabilities. Suppose I had taken ultra high resolution electron microscope images of one particular brick in the wall. And burried the hard drives on the moon. Most of the information about the wall that isn’t near the wall is the hard drives. But if you are trying to reach the top, and want to know how big a ladder to get, you still don’t care about my electron microscope images.
Humans don’t track the entire causal graph. We just track the fragments that are most important to achieving our utility function, given our mental limitations. A superintelligent AI might be able to track consequences of brick parity all over the place. All we know is that we can’t track it very far. If we are too far from the wall to see the brick parity, we can’t track it.
How do you distinguish the info not being there, from you being unable to see it? A function can be perfectly deterministic, but seem random to you because you can’t compute it.
The problem with the hard-drive example is that the information is only on that one hard drive, buried somewhere on the moon. It’s not about how much information is relevant far away, it’s about how many different far-away places the information is relevant. Information which is relevant to many different neighborhoods of far-away variables is more likely to be relevant to something humans care about (because it’s relevant to many things); information which is relevant to only a few far-away chunks of variables is less likely to touch anything humans care about.
What makes wall-height interesting is that it’s relevant to a lot of different variables in the world—or, equivalently, we can learn something about the wall-height by observing many different things from many different places. If I’m standing on the lawn next door, look down and see the building’s shadow, then I’ve gained info about the building height. If I’m looking at the block from far away, and see the building over the surrounding buildings, I’ve learned something about the height. If I’m moving a couch around inside the building, and find that I have enough space to stand the couch on its end, then I’ve learned something about the height.
To put it differently: I can learn about the height from many different vantage points.
A toy model I use to study this sort of thing: we have a sparse causal network of normal variables. Pick one neighborhood of variables in this network, and calculate what it tells you about the variables in some other neighborhood elsewhere in the network. The main empirical result is that, if we fix one neighborhood X and ask what information we can gain about X by examining many different neighborhoods Y1,Y2,..., then it turns out that most of the neighborhoods Y contain approximately-the-same information about X. (Specifically: we can apply a singular vector decomposition to the covariance matrix of X with each of the Y’s, and it turns out that it’s usually low-rank and that the X-side singular vectors are approximately the same for a wide variety of Y’s.) I’ll have a post on this at some point.
In the hard drive example, the information is only in one little chunk of the world. (Well, two little chunks: the hard drive and the original brick.) By contrast, information about the wall height is contained in a wide(r) variety of other variables in other places.
Well, at least in the toy models, I have can calculate exactly what information is available, and I do expect the key assumptions of these toy models to carry over to the real world. More generally, for chaotic systems (including e.g. motions of air molecules), we know that information is quickly wiped out given any uncertainty at all in the initial conditions.
If my only evidence were “it looks random”, then yes, I’d agree that’s weak evidence. Things we don’t understand look random, not mysterious. But we do have theory backing up the idea that information is quickly wiped out in the real world, given even very small uncertainty in initial conditions.
“partial credit” for getting close to the right answer”—this is where probability theory comes into play: “What is the likelihood of the enemy reading the message if it’s intercepted in fullness”.
Also, the concept which helps me to think about how to bring abstract concepts into reality is the time or any notion of temporal dependencies, which probability theory missed at large for a while and now recovering: see nature article.