The assumptions of stationarity and ergodicity are natural to make, but I wonder if they hide much of the difficulty of achieving indistinguishability. If we think of text sequences as the second part of a sequence where the first part is composed of whatever non-text world events preceded the text (or even more text data that was dropped from the context). I’d guess a formalization of this would violate stationarity or ergodicity. My point here is a version of the general causal confusion / hallucination points made previously e.g. here.
This is, of course, fixable by modifying the training process, but I thinks it is worth flagging that the stationarity and ergodicity assumptions are not arbitrary with respect to scaling. They are assumptions which likely bias the model towards shorter timelines. Adding more of my own inside view, I think this point is evidence for code and math scaling accelerating ahead of other domains. In general, any domain where modifying the training process to cheaply allow models to take causal actions (which deconfound/de-hallucinate) should be expected to progress faster than other domains.
The assumptions of stationarity and ergodicity are natural to make, but I wonder if they hide much of the difficulty of achieving indistinguishability. If we think of text sequences as the second part of a sequence where the first part is composed of whatever non-text world events preceded the text (or even more text data that was dropped from the context). I’d guess a formalization of this would violate stationarity or ergodicity. My point here is a version of the general causal confusion / hallucination points made previously e.g. here.
This is, of course, fixable by modifying the training process, but I thinks it is worth flagging that the stationarity and ergodicity assumptions are not arbitrary with respect to scaling. They are assumptions which likely bias the model towards shorter timelines. Adding more of my own inside view, I think this point is evidence for code and math scaling accelerating ahead of other domains. In general, any domain where modifying the training process to cheaply allow models to take causal actions (which deconfound/de-hallucinate) should be expected to progress faster than other domains.