OOD misgeneralisation is unlikely to be a direct x-risk from superintelligence
Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.
OOD misgeneralisation is absolutely inevitable, due to Gödel’s incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity. Whenever there is a new invention, such as money, internet, (future) autonomous AI agents, the civilisation becomes more complex as a whole, and distribution of many variables change. (“Towards a Theory of Evolution as Multilevel Learning” is my primary source of intuition about this.) In the study of complex systems, there is a postulate that each component (subsystem) is ignorant of the behaviour of the system as a whole, and doesn’t know the full effect of its actions. This applies to any components, no matter how intelligent. Humans misgeneralise all the time (examples: lead in petrol, creation of addictive apps such as Instagram, etc.) Superintelligence will misgeneralise, too, though perhaps in ways which are very subtle or even incomprehensible to humans.
Then, it’s possible that superintelligence will misgeneralise due to casual confusion on some matter which is critical to humans’ survival/flourishing, e. g. about something like qualia, human’s consciousness and their moral value. And, although I don’t feel this is a negligible risk, exactly because superintelligence probably won’t have direct experience or access to human consciousness, I feel this exact failure mode is somewhat minor, compared to all other reasons for which superintelligence might kill us. Anyway, I don’t see what can we do about this, if the problem is indeed that superintelligence will not have first-hand experience of human consciousness.
Alignment consequences: 2) If the model is causally confused about objects related to its goals or incentives, then it might competently pursue changes in the environment that either don’t actually result in the reward function used for training being optimised (objective misgeneralisation).
Did you use the term “objective misgeneralisation” rather than “goal misgeneralisation” on purpose? “Objective” and “goal” are synonyms, but “objective misgeneralisation” is hardly used, “goal misgeneralisation” is the standard term.
Also, I think it’s worth noting that this distinction between capabilities and goal misgeneralisation is defined in the RL framework. In other frameworks, such as Active Inference, this is the same thing, because there is no ontological distinction between reward and belief.
It might be suspected that OOD generalisation can be tackled in the scaling paradigm by using diverse enough training data, for example, including data sampled from every possible test environment. Here, we present a simple argument that this is not the case, loosely adapted from Remark 1 from Krueger et al. REx:
The reason data diversity isn’t enough comes down to concept shift (change in P(Y|X)). Such changes can be induced by changes in unobserved causal factors, Z. Returning to the ice cream (Y) and shorts (X), and sun (Z) example, shorts are a very reliable predictor of ice cream when it is sunny, but not otherwise. Putting numbers on this, let’s say P(Y=1|X=1,Z=1)=90%,P(Y=1|X=1,Z=0)=25%. Since the model doesn’t observe Z, there is not a single setting of P(Y=1|X=1)that will work reliably across different environments with different climates (different P(Z)). Instead
depends on P(Z), which in turn depends on the climate in the locations where the data was collected. In this setting, to ensure a model trained with ERM can make good predictions in a new “target” location, you would have to ensure that that location is as sunny as the average training location so that P(Z=1) is the same at training and test time. It is not enough to include data from the target location in the training set, even in the limit of infinite training data—including data from other locations changes the overall P(Z=1) of the training distribution. This means that without domain/environment labels (which would allow you to have different P(Y=1|X=1) for different environments, even if you can’t observe Z), ERM can never learn a non-causally confused model.
Maybe I miss something obvious, but this argument looks wrong to me, or it assumes that the learning algorithm is not allowed to discover additional (conceptual, abstract, hidden, implicit) variables in the training data, but this is false for deep neural networks (but true for random forests). A deep neural network can discover variables that are not present but are probable confounders of several other variables, such as “something that is a confounder of shorts, sunscreen, and ice-cream”.
Discovering such hidden confounders doesn’t give interventional capacity: Mendel discovered genetic inheritance factors, but without observing them, he couldn’t intervene on them. Only the discovery of DNA and later the invention of gene editing technology allowed intervention on genetic factors.
One can say that discovering hidden confounders merely extends what should be considered in-distribution environment. But then, what is OOD generalisation, anyway? And can’t we prove that ERM (or any other training method whatsoever) will create models which will fail sometimes simply because there is Gödel’s incompleteness in the universe?
While this model might not make very good predictions, it will correctly predict that getting you to put on shorts is not an effective way of getting you to want ice cream, and thus will be a more reliable guide for decision-making (about whether to wear shorts).
I don’t understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?
(a, ii =>)
What do these symbols in parens before the claims mean?
Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.
The main argument of the post isn’t “ASI/AGI may be causally confused, what are the consequences of that” but rather “Scaling up static pretraining may result in causally confused models, which hence probably wouldn’t be considered ASI/AGI”. I think in practice if we get AGI/ASI, then almost by definition I’d think it’s not causally confused.
OOD misgeneralisation is absolutely inevitable, due to Gödel’s incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity
In a theoretical sense this may be true (I’m not really familiar with the argument), but in practice OOD misgeneralisation is probably a spectrum, and models can be more or less causally confused about how the world works. We’re arguing here that static training, even when scaled up, plausibly doesn’t lead to a model that isn’t causally confused about a lot of how the world works.
Did you use the term “objective misgeneralisation” rather than “goal misgeneralisation” on purpose? “Objective” and “goal” are synonyms, but “objective misgeneralisation” is hardly used, “goal misgeneralisation” is the standard term.
Maybe I miss something obvious, but this argument looks wrong to me, or it assumes that the learning algorithm is not allowed to discover additional (conceptual, abstract, hidden, implicit) variables in the training data, but this is false for deep neural networks
Given that the model is trained statically, while it could hypothesise about additional variables of the kinds your listed, it can never know which variables or which values for those variables are correct without domain labels or interventional data. Specifically while “Discovering such hidden confounders doesn’t give interventional capacity” is true, to discover these confounders he needed interventional capacity.
I don’t understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?
We’re not saying that P(shorts, icecream) is good for decision making, but P(shorts, do(icecream)) is useful in sofar as the goal is to make someone where shorts, and providing icecream is one of the possible actions (as the causal model will demonstrate that providing icecream isn’t useful for making someone where shorts).
What do these symbols in parens before the claims mean?
They are meant to be referring to the previous parts of the argument, but I’ve just realised that this hasn’t worked as the labels aren’t correct. I’ll fix that.
OOD misgeneralisation is unlikely to be a direct x-risk from superintelligence
Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.
OOD misgeneralisation is absolutely inevitable, due to Gödel’s incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity. Whenever there is a new invention, such as money, internet, (future) autonomous AI agents, the civilisation becomes more complex as a whole, and distribution of many variables change. (“Towards a Theory of Evolution as Multilevel Learning” is my primary source of intuition about this.) In the study of complex systems, there is a postulate that each component (subsystem) is ignorant of the behaviour of the system as a whole, and doesn’t know the full effect of its actions. This applies to any components, no matter how intelligent. Humans misgeneralise all the time (examples: lead in petrol, creation of addictive apps such as Instagram, etc.) Superintelligence will misgeneralise, too, though perhaps in ways which are very subtle or even incomprehensible to humans.
Then, it’s possible that superintelligence will misgeneralise due to casual confusion on some matter which is critical to humans’ survival/flourishing, e. g. about something like qualia, human’s consciousness and their moral value. And, although I don’t feel this is a negligible risk, exactly because superintelligence probably won’t have direct experience or access to human consciousness, I feel this exact failure mode is somewhat minor, compared to all other reasons for which superintelligence might kill us. Anyway, I don’t see what can we do about this, if the problem is indeed that superintelligence will not have first-hand experience of human consciousness.
Did you use the term “objective misgeneralisation” rather than “goal misgeneralisation” on purpose? “Objective” and “goal” are synonyms, but “objective misgeneralisation” is hardly used, “goal misgeneralisation” is the standard term.
Also, I think it’s worth noting that this distinction between capabilities and goal misgeneralisation is defined in the RL framework. In other frameworks, such as Active Inference, this is the same thing, because there is no ontological distinction between reward and belief.
Maybe I miss something obvious, but this argument looks wrong to me, or it assumes that the learning algorithm is not allowed to discover additional (conceptual, abstract, hidden, implicit) variables in the training data, but this is false for deep neural networks (but true for random forests). A deep neural network can discover variables that are not present but are probable confounders of several other variables, such as “something that is a confounder of shorts, sunscreen, and ice-cream”.
Discovering such hidden confounders doesn’t give interventional capacity: Mendel discovered genetic inheritance factors, but without observing them, he couldn’t intervene on them. Only the discovery of DNA and later the invention of gene editing technology allowed intervention on genetic factors.
One can say that discovering hidden confounders merely extends what should be considered in-distribution environment. But then, what is OOD generalisation, anyway? And can’t we prove that ERM (or any other training method whatsoever) will create models which will fail sometimes simply because there is Gödel’s incompleteness in the universe?
I don’t understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?
What do these symbols in parens before the claims mean?
The main argument of the post isn’t “ASI/AGI may be causally confused, what are the consequences of that” but rather “Scaling up static pretraining may result in causally confused models, which hence probably wouldn’t be considered ASI/AGI”. I think in practice if we get AGI/ASI, then almost by definition I’d think it’s not causally confused.
In a theoretical sense this may be true (I’m not really familiar with the argument), but in practice OOD misgeneralisation is probably a spectrum, and models can be more or less causally confused about how the world works. We’re arguing here that static training, even when scaled up, plausibly doesn’t lead to a model that isn’t causally confused about a lot of how the world works.
No reason, I’ll edit the post to use goal misgeneralisation. Goal misgeneralisation is the standard term but hasn’t been so for very long (see e.g. this tweet: https://twitter.com/DavidSKrueger/status/1540303276800983041).
Given that the model is trained statically, while it could hypothesise about additional variables of the kinds your listed, it can never know which variables or which values for those variables are correct without domain labels or interventional data. Specifically while “Discovering such hidden confounders doesn’t give interventional capacity” is true, to discover these confounders he needed interventional capacity.
We’re not saying that P(shorts, icecream) is good for decision making, but P(shorts, do(icecream)) is useful in sofar as the goal is to make someone where shorts, and providing icecream is one of the possible actions (as the causal model will demonstrate that providing icecream isn’t useful for making someone where shorts).
They are meant to be referring to the previous parts of the argument, but I’ve just realised that this hasn’t worked as the labels aren’t correct. I’ll fix that.