Transformer language models are doing something more general

Epistemic status: speculative.

First, a few paper titles:

The gist of the first three studies is that transformers (specifically) trained on natural language (specifically) generalize better than expected, with little or no fine-tuning, not only to unseen tasks but even to unseen and apparently unrelated modalities like offline reinforcement learning. The last study takes this a step further—it doesn’t actually pretrain on language at all, but instead tries to mimic the specific statistical properties of natural language that lead to this behavior with various sampling procedures from an image classification dataset.

The difference between these results and the plethora of text-to-text transformer multitask/​transfer learning results that have come out since GPT-1 is that transfer learning to new modalities requires learning priors general enough to apply to both text and the other modality—implying, first of all, that such priors exist, which has updated me in the following directions:

  • Well-trained transformers, regardless of task, occupy the same relatively small subspace of parameter space

  • Most of the gradient-descent steps of a training run from scratch are spent just getting to this subspace; relatively few are spent learning the specific task

Taken together, these hypotheses seem to imply that within today’s gigantic and notoriously data-hungry language models is a sparser, far more efficient architecture trying to get out. I don’t have any idea what this architecture looks like. If I did, I wouldn’t post about it here. I am quite confident that it exists, because human children manage to acquire language without ingesting the equivalent of terabytes of text. I’m even reasonably confident that it’s simple, because the human genome doesn’t have enough space to code for complex mental priors (also, the evidence seems to point to the neocortex being fairly uniform), and because whatever “universal grammar” pretrained transformers are learning, it has to be fundamental enough to apply to domains as unlike language as offline reinforcement learning.

Only the last paper of the four I linked above, from DeepMind, attempts to elucidate what’s so special about language, and they focus only on a few obvious statistical features of language token distributions—while several features they tested did improve in-context (i.e. few-shot) learning when present, the paper leaves understanding the mechanism behind this improvement for further research.

The most obvious connection that I see here, among the relatively few papers I’ve read, is with Anthropic’s work on In-context Learning and Induction Heads; it seems quite possible that induction heads are this missing mechanism linking the unique properties of language distributions with in-context learning. A direction for further research, for anyone interested, might be to try to find a theoretical link between language-like (Zipfian, non-uniform) training data distributions and the formation of induction heads.

I’ll end this here, as my writing has caught up with my thinking; I’ll probably write a follow-up if the discussion on this post inspires further ideas.