(Thanks to Robert for talking with me about my initial thoughts) Here are a few potential follow-up directions:
I. (Safety) Relevant examples of Z
To build intuition on whether unobserved location tags leads to problematic misgeneralization, it would be useful to have some examples. In particular, I want to know if we should think of there being many independent, local Z_i, or dataset-wide Z? The former case seems much less concerning, as that seems less likely to lead to the adoption of a problematically mistaken ontology.
Here are a couple examples I came up with: In the NL case, the URL that the text was drawn from. In the code generation case, hardware constraints, such as RAM limits. I don’t see why a priori either of these should cause safety problems rather than merely capabilities problems. Would be curious to hear arguments here, and alternative examples which seem more safety relevant. (Note that both of these examples seem like dataset-wide Z).
II. Causal identifiability, and the testability of confoundedness
As Owain’s comment thread mentioned, models may be incentivized instrumentally to do causal analysis e.g. by using human explanations of causality. However, even given an understanding of formal methods in causal inference, the model may not have the relevant data at hand. Intuitively, I’d expect there usually not to be any deconfounding adjustment set observable in the data[1]. As a weaker assumption, one might hope that causal uncertainty might be modellable from the data. As far as I know, it’s generally not possible to rule out the existence of unobserved confounders from observational data, but there might be assumptions relevant to the LM case which allow for estimation of confoundedness.
III. Existence of malign generalizations
The strongest, and most safety relevant implication claimed is “(3) [models] reason with human concepts. We believe the issues we present here are likely to prevent (3)”. The arguments in this post increase my uncertainty on this point, but I still think there are good a priori reasons to be skeptical of this implication. In particular, it seems like we should expect various causal confusions to emerge, and it seems likely that these will be orthogonal in some sense such that as models scale they cancel and the model converges to causally-valid generalizations. If we assume models are doing compression, we can put this another way: Causal confusions yield shallow patterns (low compression) and as models scale they do better compression. As compression increases, the number of possible strategies which can do that level of compression decreases, but the true causal structure remains in the set of strategies. Hence, we should expect causal confusion-based shallow patterns to be discarded. To cash this out in terms of a simple example, this argument is roughly saying that even though data regarding the sun’s effect mediating the shorts<>ice cream connection is not observed—more and more data is being compressed regarding shorts, ice cream, and the sun. In the limit the shorts>ice cream pathway incurs a problematic compression cost which causes this hypothesis to be discarded.
High uncertainty. One relevant thought experiment is to consider adjustment sets of unobserved var Z=IsReddit. Perhaps there exists some subset of the dataset where Z=IsReddit is observable and the model learns a sub-model which gives calibrated estimates of how likely remaining text is to be derived from Reddit
(Thanks to Robert for talking with me about my initial thoughts) Here are a few potential follow-up directions:
I. (Safety) Relevant examples of Z
To build intuition on whether unobserved location tags leads to problematic misgeneralization, it would be useful to have some examples. In particular, I want to know if we should think of there being many independent, local Z_i, or dataset-wide Z? The former case seems much less concerning, as that seems less likely to lead to the adoption of a problematically mistaken ontology.
Here are a couple examples I came up with: In the NL case, the URL that the text was drawn from. In the code generation case, hardware constraints, such as RAM limits. I don’t see why a priori either of these should cause safety problems rather than merely capabilities problems. Would be curious to hear arguments here, and alternative examples which seem more safety relevant. (Note that both of these examples seem like dataset-wide Z).
II. Causal identifiability, and the testability of confoundedness
As Owain’s comment thread mentioned, models may be incentivized instrumentally to do causal analysis e.g. by using human explanations of causality. However, even given an understanding of formal methods in causal inference, the model may not have the relevant data at hand. Intuitively, I’d expect there usually not to be any deconfounding adjustment set observable in the data[1]. As a weaker assumption, one might hope that causal uncertainty might be modellable from the data. As far as I know, it’s generally not possible to rule out the existence of unobserved confounders from observational data, but there might be assumptions relevant to the LM case which allow for estimation of confoundedness.
III. Existence of malign generalizations
The strongest, and most safety relevant implication claimed is “(3) [models] reason with human concepts. We believe the issues we present here are likely to prevent (3)”. The arguments in this post increase my uncertainty on this point, but I still think there are good a priori reasons to be skeptical of this implication. In particular, it seems like we should expect various causal confusions to emerge, and it seems likely that these will be orthogonal in some sense such that as models scale they cancel and the model converges to causally-valid generalizations. If we assume models are doing compression, we can put this another way: Causal confusions yield shallow patterns (low compression) and as models scale they do better compression. As compression increases, the number of possible strategies which can do that level of compression decreases, but the true causal structure remains in the set of strategies. Hence, we should expect causal confusion-based shallow patterns to be discarded. To cash this out in terms of a simple example, this argument is roughly saying that even though data regarding the sun’s effect mediating the shorts<>ice cream connection is not observed—more and more data is being compressed regarding shorts, ice cream, and the sun. In the limit the shorts>ice cream pathway incurs a problematic compression cost which causes this hypothesis to be discarded.
High uncertainty. One relevant thought experiment is to consider adjustment sets of unobserved var Z=IsReddit. Perhaps there exists some subset of the dataset where Z=IsReddit is observable and the model learns a sub-model which gives calibrated estimates of how likely remaining text is to be derived from Reddit