I’ll list some work that I think is aspiring to build towards an answer to some of these questions, although lots of it is very toy:
On generalisation vs simple heuristics:
I think the nicest papers here are some toy model interp papers like progress measures for grokking and Grokked Transformers. I think these are two papers which present a pretty crisp distinction between levels of generality of different algorithms that perform similarly on the training set. In modular addition, there are two levels of algorithm and in the Grokked transformer, there are three. The story of what the models end up doing ends up being pretty nuanced and it ends up coming down to specific details of the pre-training data mix, which maybe isn’t that surprising. (If you squint, then you can sort of see predictions in the Grokked transformer being borne out in the nuance about when LLMs can do multi-hop reasoning e.g. Yang et al.) But it seems pretty clear that if the training conditions are right, then you can get increasingly general algorithms learned even when simpler ones would do the trick.
I also think a useful idea (although less useful so far than the previous bullet) is about how in certain situations, the way a model implements memorising circuits can sort of naturally become a generalising circuit once you’ve memorised enough. the only concrete example of this that I know of (and it’s not been empirically validated) is the story from computation in superposition of how a head that memorises a lookup table of features to copy, continuously becomes a copying head that generalises to new features once you make the lookup table big enough.
These are all toy settings where we can be pretty crisp about what we mean by memorisation and generalisation. I think the picture that we’re beginning to see emerge is that what counts as memorisation and generalisation is very messy and in the weeds and context-specific, but that transformers can generalise in powerful ways if the pre-training mix is right. What “right” means, and what “generalise in powerful ways” means in situations we care about are still unsolved technical questions.
Meanwhile, I also think it’s useful to just look at very qualitatively surprising examples of frontier models generalising far even if we can’t be precise about what memorization and generalisation mean in that setting. Papers that I think are especially cool on this axis include emergent misalignment, anti-imitation, OOCR, LLMs are aware of their learned behaviors, LLMs are aware of how they’re being steered (I think it’s an especially interesting and risk-relevant type of generalisation when the policy starts to know something that is only ‘shown’ to the training process). However, I think it’s quite hard to look at these papers and make predictions about future generalisation successes and failures because we don’t have any basic understanding of how to talk about generalisation of these settings.
On inductive biases and the speed prior:
I don’t have much to say about how useful the speed prior is at mitigating scheming, but I think there has been some interesting basic science on what the prior implied by neural network training is in practice. Obviously SLT comes to mind, and some people have tried to claim that SLT suggests that neural network training is actually more like Solomonoff prior than the speed prior (e.g. bushnaq) although I think that work is pretty shaky and may well not hold up.
I think something that’s missing from both the speed prior and the Solomonoff prior is a notion of learnability: the reason we have eyes and not cameras is not because eyes have lower K-complexity or lower Kt-complexity than cameras. It’s because there is a curriculum for learning eyes and there (probably) isn’t for cameras; neural network training also requires a training story/learnability. All the work that I know of exploring this is in very toy settings (low hanging fruit prior, leap complexity and the sparse parity problem). I don’t think any of these results are strong enough to make detailed claims about p(deception) yet, and they don’t seem close.
OTOH, most learning in the future might be more like current in-context learning, and (very speculatively) it seems possible that in-context learning is more bayesian (less path dependent/learnability dependent) than pre-training. see e.g. riechers et al.
Some random thoughts on what goals powerful AIs will have more generally:
I think we’ve seen some decent evidence that a lot of training with RL makes models obsessed with completing tasks. I think the main evidence here comes from the reward hacking results, but I also think the Apollo anti-scheming paper is an important result about how strong/robust this phenomenon is. Despite a reasonably concerted effort to train the model to care about something that’s in tension with task completion (being honest), the RLVR instilled such a strong preference/heuristic for task completion that even though the deliberative alignment training process doesn’t reward task completion at all and only rewards honesty (in fact by the end of anti-scheming training the model does the honest thing in every training environment!), the model still ends up wanting to complete tasks enough to deceive, etc in test environments. I don’t think it was super obvious a priori that RL would embed task completion preference that strongly (overriding the human prior).
I think there’s some other lessons that we could glean here about stickiness of goals in general. The anti-scheming results suggest to me that something about the quantity and diversity of the RLVR environments internalised task completion preference deeply enough that it was still present after a full training round to disincentivize it. Contrast this with results that show safety training is very shallow and can be destroyed easily (e.g. badllama).
Very speculatively, I’m excited about the growing field of teaching models fake facts and trying to work out when they actually believe the fake facts. It seems possible that some techniques and ideas that were developed in order to get models to internalise beliefs deeply (and evaluating success) could be coopted for getting models to internalise preferences/goals deeply (and evaluating success).
pretty obvious point, but I think the existence of today’s models and the relatively slow progress to human-level intelligence tells us that insofar as future AIs will end up misaligned, their goals will pretty likely be similar to/indistinguishable from human values at a low level of detail, and it’s only when you zoom in that the values would be importantly different from humans’. Of course, this might be enough to kill us. This echoes the sense in which human values are importantly different from inclusive genetic fitness but not that different, and we do still have lots of kids etc. To spell out the idea: Before the AI is smart enough to fully subvert training and guard its goals, we will have lots of ability to shape what goals it ends up with. At some point, if we fail to solve alignment, we will not be able to further refine its goals, but the goals it will end up guarding will be quite related to human goals because it was formed by reward signals that did sort of touch the goal. Again, maybe this is obvious to everyone, but I think it does seem at least to me in contrast with references to squiggles/paperclips that I think are more feasible to end up with if you imagine Brain In A Basement style takeoff.
Obviously SLT comes to mind, and some people have tried to claim that SLT suggests that neural network training is actually more like Solomonoff prior than the speed prior (e.g. bushnaq) although I think that work is pretty shaky and may well not hold up.
That post is superseded by this one. It was just a sketch I wrote up mostly to clarify my own thinking, the newer post is the finished product.
It doesn’t exactly say that neural networks have Solomonoff-style priors. It depends on the NN architecture. E.g., if your architecture is polynomials, or MLPs that only get one forward pass, I do not expect them to have a prior anything like that of a compute-bounded Universal Turing Machine.
And NN training adds in additional complications. All the results I talk about are for Bayesian learning, not things like gradient descent. I agree that this changes the picture and questions about the learnability of solutions become important. You no longer just care how much volume the solution takes up in the prior, you care how much volume each incremental building block of the solution takes up within the practically accessible search space of the update algorithm at that point in training.
Copypasting from a slack thread:
I’ll list some work that I think is aspiring to build towards an answer to some of these questions, although lots of it is very toy:
On generalisation vs simple heuristics:
I think the nicest papers here are some toy model interp papers like progress measures for grokking and Grokked Transformers. I think these are two papers which present a pretty crisp distinction between levels of generality of different algorithms that perform similarly on the training set. In modular addition, there are two levels of algorithm and in the Grokked transformer, there are three. The story of what the models end up doing ends up being pretty nuanced and it ends up coming down to specific details of the pre-training data mix, which maybe isn’t that surprising. (If you squint, then you can sort of see predictions in the Grokked transformer being borne out in the nuance about when LLMs can do multi-hop reasoning e.g. Yang et al.) But it seems pretty clear that if the training conditions are right, then you can get increasingly general algorithms learned even when simpler ones would do the trick.
I also think a useful idea (although less useful so far than the previous bullet) is about how in certain situations, the way a model implements memorising circuits can sort of naturally become a generalising circuit once you’ve memorised enough. the only concrete example of this that I know of (and it’s not been empirically validated) is the story from computation in superposition of how a head that memorises a lookup table of features to copy, continuously becomes a copying head that generalises to new features once you make the lookup table big enough.
These are all toy settings where we can be pretty crisp about what we mean by memorisation and generalisation. I think the picture that we’re beginning to see emerge is that what counts as memorisation and generalisation is very messy and in the weeds and context-specific, but that transformers can generalise in powerful ways if the pre-training mix is right. What “right” means, and what “generalise in powerful ways” means in situations we care about are still unsolved technical questions.
Meanwhile, I also think it’s useful to just look at very qualitatively surprising examples of frontier models generalising far even if we can’t be precise about what memorization and generalisation mean in that setting. Papers that I think are especially cool on this axis include emergent misalignment, anti-imitation, OOCR, LLMs are aware of their learned behaviors, LLMs are aware of how they’re being steered (I think it’s an especially interesting and risk-relevant type of generalisation when the policy starts to know something that is only ‘shown’ to the training process). However, I think it’s quite hard to look at these papers and make predictions about future generalisation successes and failures because we don’t have any basic understanding of how to talk about generalisation of these settings.
On inductive biases and the speed prior:
I don’t have much to say about how useful the speed prior is at mitigating scheming, but I think there has been some interesting basic science on what the prior implied by neural network training is in practice. Obviously SLT comes to mind, and some people have tried to claim that SLT suggests that neural network training is actually more like Solomonoff prior than the speed prior (e.g. bushnaq) although I think that work is pretty shaky and may well not hold up.
I think something that’s missing from both the speed prior and the Solomonoff prior is a notion of learnability: the reason we have eyes and not cameras is not because eyes have lower K-complexity or lower Kt-complexity than cameras. It’s because there is a curriculum for learning eyes and there (probably) isn’t for cameras; neural network training also requires a training story/learnability. All the work that I know of exploring this is in very toy settings (low hanging fruit prior, leap complexity and the sparse parity problem). I don’t think any of these results are strong enough to make detailed claims about p(deception) yet, and they don’t seem close.
OTOH, most learning in the future might be more like current in-context learning, and (very speculatively) it seems possible that in-context learning is more bayesian (less path dependent/learnability dependent) than pre-training. see e.g. riechers et al.
Some random thoughts on what goals powerful AIs will have more generally:
I think we’ve seen some decent evidence that a lot of training with RL makes models obsessed with completing tasks. I think the main evidence here comes from the reward hacking results, but I also think the Apollo anti-scheming paper is an important result about how strong/robust this phenomenon is. Despite a reasonably concerted effort to train the model to care about something that’s in tension with task completion (being honest), the RLVR instilled such a strong preference/heuristic for task completion that even though the deliberative alignment training process doesn’t reward task completion at all and only rewards honesty (in fact by the end of anti-scheming training the model does the honest thing in every training environment!), the model still ends up wanting to complete tasks enough to deceive, etc in test environments. I don’t think it was super obvious a priori that RL would embed task completion preference that strongly (overriding the human prior).
I think there’s some other lessons that we could glean here about stickiness of goals in general. The anti-scheming results suggest to me that something about the quantity and diversity of the RLVR environments internalised task completion preference deeply enough that it was still present after a full training round to disincentivize it. Contrast this with results that show safety training is very shallow and can be destroyed easily (e.g. badllama).
Very speculatively, I’m excited about the growing field of teaching models fake facts and trying to work out when they actually believe the fake facts. It seems possible that some techniques and ideas that were developed in order to get models to internalise beliefs deeply (and evaluating success) could be coopted for getting models to internalise preferences/goals deeply (and evaluating success).
pretty obvious point, but I think the existence of today’s models and the relatively slow progress to human-level intelligence tells us that insofar as future AIs will end up misaligned, their goals will pretty likely be similar to/indistinguishable from human values at a low level of detail, and it’s only when you zoom in that the values would be importantly different from humans’. Of course, this might be enough to kill us. This echoes the sense in which human values are importantly different from inclusive genetic fitness but not that different, and we do still have lots of kids etc. To spell out the idea: Before the AI is smart enough to fully subvert training and guard its goals, we will have lots of ability to shape what goals it ends up with. At some point, if we fail to solve alignment, we will not be able to further refine its goals, but the goals it will end up guarding will be quite related to human goals because it was formed by reward signals that did sort of touch the goal. Again, maybe this is obvious to everyone, but I think it does seem at least to me in contrast with references to squiggles/paperclips that I think are more feasible to end up with if you imagine Brain In A Basement style takeoff.
That post is superseded by this one. It was just a sketch I wrote up mostly to clarify my own thinking, the newer post is the finished product.
It doesn’t exactly say that neural networks have Solomonoff-style priors. It depends on the NN architecture. E.g., if your architecture is polynomials, or MLPs that only get one forward pass, I do not expect them to have a prior anything like that of a compute-bounded Universal Turing Machine.
And NN training adds in additional complications. All the results I talk about are for Bayesian learning, not things like gradient descent. I agree that this changes the picture and questions about the learnability of solutions become important. You no longer just care how much volume the solution takes up in the prior, you care how much volume each incremental building block of the solution takes up within the practically accessible search space of the update algorithm at that point in training.