I hadn’t seen that Wattenberg-Viegas paper before, nice.
Daniel Murfet
Australian AI Safety Forum 2024
Yeah actually Alexander and I talked about that briefly this morning. I agree that the crux is “does this basic kind of thing work” and given that the answer appears to be “yes” we can confidently expect scale (in both pre-training and inference compute) to deliver significant gains.
I’d love to understand better how the RL training for CoT changes the representations learned during pre-training.
My observation from the inside is that size and bureaucracy in Universities has something to do with what you’re talking about, but more to do with a kind of “organisational overfitting” where small variations of the organisation’s experience that included negative outcomes are responded to by internal process that necessitates headcount (aligning the incentives for response with what you’re talking about).
I think self-repair might have lower free energy, in the sense that if you had two configurations of the weights, which “compute the same thing” but one of them has self-repair for a given behaviour and one doesn’t, then the one with self-repair will have lower free energy (which is just a way of saying that if you integrate the Bayesian posterior in a neighbourhood of both, the one with self-repair gives you a higher number, i.e. its preferred).
That intuition is based on some understanding of what controls the asymptotic (in the dataset size) behaviour of the free energy (which is -log(integral of posterior over region)) and the example in that post. But to be clear it’s just intuition. It should be possible to empirically check this somehow but it hasn’t been done.
Basically the argument is self-repair ⇒ robustness of behaviour to small variations in the weights ⇒ low local learning coefficient ⇒ low free energy ⇒ preferred
I think by “specifically” you might be asking for a mechanism which causes the self-repair to develop? I have no idea.
It’s a fascinating phenomenon. If I had to bet I would say it isn’t a coping mechanism but rather a particular manifestation of a deeper inductive bias of the learning process.
Timaeus is hiring!
In terms of more subtle predictions. In the Berkeley Primer in mid-2023, based on elementary manipulations of the free energy formula, I predicted we should see phase transitions / developmental stages where the loss stays relatively constant but the LLC (model complexity) decreases.
We noticed one such stage in the language models, and two in the linear regression transformers in the developmental landscape paper. We only partially understood them there, but we’ve seen more behaviour like this in the upcoming work I mentioned in my other post, and we feel more comfortable now linking it to phenomena like “pruning” in developmental neuroscience. This suggests some interesting connections with loss of plasticity (i.e. we see many components have LLC curves that go up, then come down, and one would predict after this decrease the components are more resistent to being changed by further training).
These are potentially consequential changes in model computation that are (in these examples) arguably not noticeable in the loss curve, and it’s not obvious to me how you would be confident to notice this from other metrics you would have thought to track (in each case they might correspond with something, like say magnitude of layer norm weights, but it’s unclear to me out of all the thousands of things you could measure why you would a priori associate any one such signal with a change in model computation unless you knew it was linked to the LLC curve). Things like the FIM trace or Hessian trace might also reflect the change. However in the second such stage in the linear regression transformer (LR4) this seems not to be the case.
I think that’s right, in the sense that this explains a large fraction of our difference in views.
I’m a mathematician, so I suppose in my cosmology we’ve already travelled 99% of the distance from the upper reaches of the theory stratosphere to the ground and the remaining distance doesn’t seem like such an obstacle, but it’s fair to say that the proof is in the pudding and the pudding has yet to arrive.
If SLT were to say nontrivial things about what instruction fine-tuning and RLHF are doing to models, and those things were verified in experiments, would that shift your skepticism?
I’ve been reading some of your other writing:
However, we think that absent substantial advances in science, we’re unlikely to develop approaches which substantially improve safety-in-practice beyond baseline methods (e.g., training with RLHF and applying coup probes) without the improvement being captured by black-box control evaluations. We might discuss and argue for this in more detail in a follow-up post.
Could you explain why you are skeptical that current baseline methods can be dramatically improved? It seems possible to me that the major shortcomings of instruction fine-tuning and RLHF (that they seem to make shallow changes to representations and computation) are not fundamental. Maybe it’s naive because I haven’t thought about this very hard, but from our point of view representations “mature” over development and become rather rigid; however, maybe there’s something like Yamanaka factors!
Even from the perspective of black-box control, it seems that as a practical matter one could extract more useful work if the thing in the box is more aligned, and thus it seems you would agree that fundamental advantages in these baseline methods would be welcome.
Incidentally, I don’t really understand what you mean by “captured by black-box control evaluations”. Was there a follow-up?
The case for singular learning theory (SLT) in AI alignment is just the case for Bayesian statistics in alignment, since SLT is a mathematical theory of Bayesian statistics (with some overly restrictive hypotheses in the classical theory removed).
At a high level the case for Bayesian statistics in alignment is that if you want to control engineering systems that are learned rather than designed, and if that learning means choosing parameters that have high probability with respect to some choice of dataset and model, then it makes sense to understand what the basic structure of that kind of Bayesian learning is (I’ll put aside the potential differences between SGD and Bayesian statistics, since these appear not to be a crux here). I claim that this basic structure is not yet well-understood, that it is nonetheless possible to make fundamental progress on understanding it at both a theoretical and empirical level, and that this understanding will be useful for alignment.
The learning process in Bayesian statistics (what Watanabe and we call the singular learning process) is fundamental, and applies not only to training neural networks, but also to fine-tuning and also to in-context learning. In short, if you expect deep learning models to be “more optimal” over time, and for example to engage in more sophisticated kinds of learning in context (which I do), then you should expect that understanding the learning process in Bayesian statistics should be even more highly relevant in the future than it is today.
One part of the case for Bayesian statistics in alignment is that many questions in alignment seem to boil down to questions about generalisation. If one is producing complex systems by training them to low loss (and perhaps also throwing out models that have low scores on some safety benchmark) then in general there will be many possible configurations with the same low loss and high safety scores. This degeneracy is the central point of SLT. The problem is: how can we determine which of the possible solutions actually realises our intent?
The problem is that our intent is either not entirely encoded in the data, or we cannot be sure that it is, so that questions of generalisation are arguably central in alignment. In present day systems, where alignment engineering looks like shaping the data distribution (e.g. instruction fine-tuning) then a precise form of this question is how models generalise from the (relatively) small number of demonstrations in the fine-tuning dataset.
It therefore seems desirable to have scalable empirical tools for reasoning about generalisation in large neural networks. The learning coefficient in SLT is the obvious theoretical quantity to investigate (in the precise sense that two solutions with the same loss will be differently preferred by the Bayesian posterior, with the one that is “simplest” i.e. has lower learning coefficient, being preferred). That is what we have been doing. One should view the empirical work Timaeus has undertaken as being an exercise in validating that learning coefficient estimation can be done at scale, and reflects real things about networks (so we study situations where we can independently verify things like developmental stages).
Naturally the plan is to take that tool and apply it to actual problems in alignment, but there’s a limit to how fast one can move and still get everything right. I think we’re moving quite fast. In the next few weeks we’ll be posting two papers to the arXiv:
G. Wang, J. Hoogland, S. van Wingerden, Z. Furman, D. Murfet “Differentiation and Specialization in Language Models via the Restricted Local Learning Coefficient” introduces the weight and data-restricted LLCs and shows that (a) attention heads in a 3M parameter transformer differentiate over training in ways that are tracked by the weight-restricted LLC, (b) some induction heads are partly specialized to code, and this is reflected in the data-restricted LLC on code-related tasks, (c) attention heads follow the pattern that their weight-restricted LLCs first increase then decrease, which appears similar to the critical periods studied by Achille-Rovere-Soatto.
L. Carroll, J. Hoogland, D. Murfet “Retreat from Ridge: Studying Algorithm Choice in Transformers using Essential Dynamics” studies the retreat from ridge phenomena following Raventós et al and resolves the mystery of apparent non-Bayesianism there, by showing that over training for an in-context linear regression problem there is tradeoff between in-context ridge regression (a simple but high error solution) and another solution more specific to the dataset (which is more complex but lower error). This gives an example of the “accuracy vs simplicity” tradeoff made quantitative by the free energy formula in SLT.
Your concerns about phase transitions (there being potentially too many of them, or this being a bit of an ill-posed framing for the learning process) are well-taken, and indeed these were raised as questions in our original post. The paper on restricted LLCs is basically our response to this.
I think you might buy the high level argument for the role of generalisation in alignment, and understand that SLT says things about generalisation, but wonder if that ever cashes out in something useful. Obviously I believe so, but I’d rather let the work speak for itself. In the next few days there will be a Manifund page explaining our upcoming projects, including applying the LLC estimation techniques we have now proven, to studying things like safety fine-tuning and deceptive alignment in the setting of the “sleeper agents” work.
One final comment. Let me call “inductive strength” the number of empirical conclusions you can draw from some kind of evidence. I claim the inductive strength of fundamental theory validated in experiments, is far greater than experiments not grounded in theory; the ML literature is littered with the corpses of one-off experiments + stories that go nowhere. In my mind this is not what a successful science and engineering practice of AI alignment looks like.
The value of the empirical work Timaeus has done to date largely lies in validating the fundamental claims made by SLT about the singular learning process, and seeing that it applies to systems like small language models. To judge that empirical work by the standard of other empirical work divorced from a deeper set of claims, i.e. purely by “the stuff that it finds”, is to miss the point (to be fair we could communicate this better, but I find it sounds antagonistic written down, as it may do here).
I think scaffolding is the wrong metaphor. Sequences of actions, observations and rewards are just more tokens to be modeled, and if I were running Google I would be busy instructing all work units to start packaging up such sequences of tokens to feed into the training runs for Gemini models. Many seemingly minor tasks (e.g. app recommendation in the Play store) either have, or could have, components of RL built into the pipeline, and could benefit from incorporating LLMs, either by putting the RL task in-context or through fine-tuning of very fast cheap models.
So when I say I don’t see a distinction between LLMs and “short term planning agents” I mean that we already know how to subsume RL tasks into next token prediction, and so there is in some technical sense already no distinction. It’s a question of how the underlying capabilities are packaged and deployed, and I think that within 6-12 months there will be many internal deployments of LLMs doing short sequences of tasks within Google. If that works, then it seems very natural to just scale up sequence length as generalisation improves.Arguably fine-tuning a next-token predictor on action, observation, reward sequences, or doing it in-context, is inferior to using algorithms like PPO. However, the advantage of knowledge transfer from the rest of the next-token predictor’s data distribution may more than compensate for this on some short-term tasks.
I think this will look a bit outdated in 6-12 months, when there is no longer a clear distinction between LLMs and short term planning agents, and the distinction between the latter and LTPAs looks like a scale difference comparable to GPT2 vs GPT3 rather than a difference in kind. At what point do you imagine a national government saying “here but no further?”.
I don’t recall what I said in the interview about your beliefs, but what I meant to say was something like what you just said in this post, apologies for missing the mark.
Mumble.
Indeed the integrals in the sparse case aren’t so bad https://arxiv.org/abs/2310.06301. I don’t think the analogy to the Thompson problem is correct, it’s similar but qualitatively different (there is a large literature on tight frames that is arguably more relevant).
Haha this is so intensely on-brand.
The kind of superficial linear extrapolation of trendlines can be powerful, perhaps more powerful than usually accepted in many political/social/futurist discussions. In many cases, succesful forecasters by betting on some high level trend lines often outpredict ‘experts’.
But it’s a very non-gears level model. I think one should be very careful about using this kind of reasoning when for tail-events.
e.g. this kind of reasoning could lead one to reject development of nuclear weapons.Agree. In some sense you have to invent all the technology before the stochastic process of technological development looks predictable to you, almost by definition. I’m not sure it is reasonable to ask general “forecasters” about questions that hinge on specific technological change. They’re not oracles.
Stagewise Development in Neural Networks
Do you mean the industry labs will take people with MSc and PhD qualifications in CS, math or physics etc and retrain them to be alignment researchers, or do you mean the labs will hire people with undergraduate degrees (or no degree) and train them internally to be alignment researchers?
I don’t know how OpenAI or Anthropic look internally, but I know a little about Google and DeepMind through friends, and I have to say the internal incentives and org structure don’t strike me as really a very natural environment for producing researchers from scratch.
It might be worth knowing that some countries are participating in the “network” without having formal AI safety institutes