I would guess that AI systems will become more interpretable in the future, as they start using the features / concepts / abstractions that humans are using.
This sort of reasoning seems to assume that abstraction space is 1 dimensional, so AI must use human concepts on the path from subhuman to superhuman. I disagree. Like most things we don’t have strong reason to think is 1D, and which take many bits of info to describe, abstractions seem high dimensional. So on the path from subhuman to superhuman, the AI must use abstractions that are as predicatively useful as human abstractions. These will not be anything like human abstractions unless the system was designed from a detailed neurological model of humans. Any AI that humans can reason about using our inbuilt empathetic reasoning is basically a mind upload, or a mind that differs from human less than humans differ from each other. This is not what ML will create. Human understanding of AI systems will have to be by abstract mathematical reasoning, the way we understand formal maths. Empathetic reasoning about human level AI is just asking for anthropomorphism. Our 3 options are
While I might agree with the three options at the bottom, I don’t agree with the reasoning to get there.
Abstractions are pretty heavily determined by the territory. Humans didn’t look at the world and pick out “tree” as an abstract concept because of a bunch of human-specific factors. “Tree” is a recurring pattern on earth, and even aliens would notice that same cluster of things, assuming they paid attention. Even on the empathic front, you don’t need a human-like mind in order to notice the common patterns of human behavior (in humans) which we call “anger” or “sadness”.
Some abstractions are heavily determined by the territory. The concept of trees is pretty heavily determined by the territory. Whereas the concept of betrayal is determined by the way that human minds function, which is determined by other people’s abstractions. So while it seems reasonably likely to me that an AI “naturally thinks” in terms of the same low-level abstractions as humans, it thinking in terms of human high-level abstractions seems much less likely, absent some type of safety intervention. Which is particularly important because most of the key human values are very high-level abstractions.
My guess is that if you have to deal with humans, as at least early AI systems will have to do, then abstractions like “betrayal” are heavily determined.
I agree that if you don’t have to deal with humans, then things like “betrayal” may not arise; similarly if you don’t have to deal with Earth, then “trees” are not heavily determined abstractions.
Neural nets have around human performance on Imagenet.
If abstraction was a feature of the territory, I would expect the failure cases to be similar to human failure cases. Looking at https://github.com/hendrycks/natural-adv-examples, This does not seem to be the case very strongly, but then again, some of them contain dark shiny stone being classified as a sea lion. The failures aren’t totally inhuman, the way they are with adversarial examples.
Humans didn’t look at the world and pick out “tree” as an abstract concept because of a bunch of human-specific factors.
I am not saying that trees aren’t a cluster in thing space. What I am saying is that if there were many cluster in thing space that were as tight and predicatively useful as “Tree”, but were not possible for humans to conceptualize, we wouldn’t know it. There are plenty of concepts that humans didn’t develop for most of human history, despite those concepts being predicatively useful, until an odd genius came along or the concept was pinned down by massive experimental evidence. Eg inclusive genetic fitness, entropy ect.
Consider that evolution optimized us in an environment that contained trees, and in which predicting them was useful, so it would be more surprising for there to be a concept that is useful in the ancestral environment that we can’t understand, than a concept that we can’t understand in a non ancestral domain.
This looks like a map that is heavily determined by the territory, but human maps contain rivers and not geological rock formations. There could be features that could be mapped that humans don’t map.
If you believe the post that
Eventually, sufficiently intelligent AI systems will probably find even better concepts that are alien to us,
Then you can form an equally good, nonhuman concept by taking the better alien concept and adding random noise. Of course, an AI trained on text might share our concepts just because our concepts are the most predicatively useful ways to predict our writing. I would also like to assign some probability to AI systems that don’t use anything recognizable as a concept. You might be able to say 90% of blue objects are egg shaped, 95% of cubes are red … 80% of furred objects that glow in the dark are flexible … without ever splitting objects into bleggs and rubes. Seen from this perspective, you have a density function over thingspace, and a sum of clusters might not be the best way to describe it. AIXI never talks about trees, it just simulates every quantum. Maybe there are fast algorithms that don’t even ascribe discrete concepts.
Neural nets have around human performance on Imagenet.
But those trained neural nets are very subhuman on other image understanding tasks.
Then you can form an equally good, nonhuman concept by taking the better alien concept and adding random noise.
I would expect that the alien concepts are something we haven’t figured out because we don’t have enough data or compute or logic or some other resource, and that constraint will also apply to the AI. If you take that concept and “add random noise” (which I don’t really understand), it would presumably still require the same amount of resources, and so the AI still won’t find it.
For the rest of your comment, I agree that we can’t theoretically rule those scenarios out, but there’s no theoretical reason to rule them in either. So far the empirical evidence seems to me to be in favor of “abstractions are determined by the territory”, e.g. ImageNet neural nets seems to have human-interpretable low-level abstractions (edge detectors, curve detectors, color detectors), while having strange high-level abstractions; I claim that the strange high-level abstractions are bad and only work on ImageNet because they were specifically designed to do so and ImageNet is sufficiently narrow that you can get to good performance with bad abstractions.
By adding random noise, I meant adding wiggles to the edge of the set in thingspace for example adding noise to “bird” might exclude “ostrich” and include “duck bill platypus”.
I agree that the high level image net concepts are bad in this sense, however are they just bad. If they were just bad and the limit to finding good concepts was data or some other resource, then we should expect small children and mentally impaired people to have similarly bad concepts. This would suggest a single gradient from better to worse. If however current neural networks used concepts substantially different from small children, and not just uniformly worse or uniformly better, that would show different sets of concepts at the same low level. This would be fairly strong evidence of multiple concepts at the smart human level.
I would also want to point out that a small fraction of the concepts being different would be enough to make alignment much harder. Even if their was a perfect scale, if 1⁄3 of the concepts are subhuman, 1⁄3 human level and 1⁄3 superhuman, it would be hard to understand the system. To get any safety, you need to get your system very close to human concepts. And you need to be confidant that you have hit this target.
This sort of reasoning seems to assume that abstraction space is 1 dimensional, so AI must use human concepts on the path from subhuman to superhuman. I disagree. Like most things we don’t have strong reason to think is 1D, and which take many bits of info to describe, abstractions seem high dimensional. So on the path from subhuman to superhuman, the AI must use abstractions that are as predicatively useful as human abstractions. These will not be anything like human abstractions unless the system was designed from a detailed neurological model of humans. Any AI that humans can reason about using our inbuilt empathetic reasoning is basically a mind upload, or a mind that differs from human less than humans differ from each other. This is not what ML will create. Human understanding of AI systems will have to be by abstract mathematical reasoning, the way we understand formal maths. Empathetic reasoning about human level AI is just asking for anthropomorphism. Our 3 options are
1) An AI we don’t understand
2) An AI we can reason about in terms of maths.
3) A virtual human.
While I might agree with the three options at the bottom, I don’t agree with the reasoning to get there.
Abstractions are pretty heavily determined by the territory. Humans didn’t look at the world and pick out “tree” as an abstract concept because of a bunch of human-specific factors. “Tree” is a recurring pattern on earth, and even aliens would notice that same cluster of things, assuming they paid attention. Even on the empathic front, you don’t need a human-like mind in order to notice the common patterns of human behavior (in humans) which we call “anger” or “sadness”.
+1, that’s my response as well.
Some abstractions are heavily determined by the territory. The concept of trees is pretty heavily determined by the territory. Whereas the concept of betrayal is determined by the way that human minds function, which is determined by other people’s abstractions. So while it seems reasonably likely to me that an AI “naturally thinks” in terms of the same low-level abstractions as humans, it thinking in terms of human high-level abstractions seems much less likely, absent some type of safety intervention. Which is particularly important because most of the key human values are very high-level abstractions.
My guess is that if you have to deal with humans, as at least early AI systems will have to do, then abstractions like “betrayal” are heavily determined.
I agree that if you don’t have to deal with humans, then things like “betrayal” may not arise; similarly if you don’t have to deal with Earth, then “trees” are not heavily determined abstractions.
Neural nets have around human performance on Imagenet.
If abstraction was a feature of the territory, I would expect the failure cases to be similar to human failure cases. Looking at https://github.com/hendrycks/natural-adv-examples, This does not seem to be the case very strongly, but then again, some of them contain dark shiny stone being classified as a sea lion. The failures aren’t totally inhuman, the way they are with adversarial examples.
I am not saying that trees aren’t a cluster in thing space. What I am saying is that if there were many cluster in thing space that were as tight and predicatively useful as “Tree”, but were not possible for humans to conceptualize, we wouldn’t know it. There are plenty of concepts that humans didn’t develop for most of human history, despite those concepts being predicatively useful, until an odd genius came along or the concept was pinned down by massive experimental evidence. Eg inclusive genetic fitness, entropy ect.
Consider that evolution optimized us in an environment that contained trees, and in which predicting them was useful, so it would be more surprising for there to be a concept that is useful in the ancestral environment that we can’t understand, than a concept that we can’t understand in a non ancestral domain.
This looks like a map that is heavily determined by the territory, but human maps contain rivers and not geological rock formations. There could be features that could be mapped that humans don’t map.
If you believe the post that
Then you can form an equally good, nonhuman concept by taking the better alien concept and adding random noise. Of course, an AI trained on text might share our concepts just because our concepts are the most predicatively useful ways to predict our writing. I would also like to assign some probability to AI systems that don’t use anything recognizable as a concept. You might be able to say 90% of blue objects are egg shaped, 95% of cubes are red … 80% of furred objects that glow in the dark are flexible … without ever splitting objects into bleggs and rubes. Seen from this perspective, you have a density function over thingspace, and a sum of clusters might not be the best way to describe it. AIXI never talks about trees, it just simulates every quantum. Maybe there are fast algorithms that don’t even ascribe discrete concepts.
But those trained neural nets are very subhuman on other image understanding tasks.
I would expect that the alien concepts are something we haven’t figured out because we don’t have enough data or compute or logic or some other resource, and that constraint will also apply to the AI. If you take that concept and “add random noise” (which I don’t really understand), it would presumably still require the same amount of resources, and so the AI still won’t find it.
For the rest of your comment, I agree that we can’t theoretically rule those scenarios out, but there’s no theoretical reason to rule them in either. So far the empirical evidence seems to me to be in favor of “abstractions are determined by the territory”, e.g. ImageNet neural nets seems to have human-interpretable low-level abstractions (edge detectors, curve detectors, color detectors), while having strange high-level abstractions; I claim that the strange high-level abstractions are bad and only work on ImageNet because they were specifically designed to do so and ImageNet is sufficiently narrow that you can get to good performance with bad abstractions.
By adding random noise, I meant adding wiggles to the edge of the set in thingspace for example adding noise to “bird” might exclude “ostrich” and include “duck bill platypus”.
I agree that the high level image net concepts are bad in this sense, however are they just bad. If they were just bad and the limit to finding good concepts was data or some other resource, then we should expect small children and mentally impaired people to have similarly bad concepts. This would suggest a single gradient from better to worse. If however current neural networks used concepts substantially different from small children, and not just uniformly worse or uniformly better, that would show different sets of concepts at the same low level. This would be fairly strong evidence of multiple concepts at the smart human level.
I would also want to point out that a small fraction of the concepts being different would be enough to make alignment much harder. Even if their was a perfect scale, if 1⁄3 of the concepts are subhuman, 1⁄3 human level and 1⁄3 superhuman, it would be hard to understand the system. To get any safety, you need to get your system very close to human concepts. And you need to be confidant that you have hit this target.