I’d be really interested in someone trying to answer the question: what updates on the a priori arguments about AI goal structures should we make as a result of empirical evidence that we’ve seen? I’d love to see a thoughtful and comprehensive discussion of this topic from someone who is both familiar with the conceptual arguments about scheming and also relevant AI safety literature (and maybe AI literature more broadly).
Maybe a good structure would be, from the a priori arguments, identifying core uncertainties like “How strong is the imitative prior?” And “How strong is the speed prior?” And “To what extent do AIs tend to generalize versus learn narrow heuristics?” and tackling each. (Of course, that would only make sense if the empirical updates actually factor nicely into that structure.)
I feel like I understand this very poorly right now. I currently think the only important update that empirical evidence has given me, compared to the arguments in 2020, is that the human-imitation prior is more powerful than I expected. (Though of course it’s unclear whether this will continue (and basic points like the expected increasing importance of RL suggest that it will be less powerful over time.)) But to my detriment, I don’t actually read the AI safety literature very comprehensively, and I might be missing empirical evidence that really should update me.
Copy-pasting what I wrote in a Slack thread about this:
My current take, having thought a lot about a few things in this domain, but not necessarily this specific question, is that the only dimensions where the empirical evidence feels like it was useful, besides a broad “yes, of course the problems are real, and AGI is possible, and it won’t take hundreds of years” confirmation, are the dynamics around how much you can steer and control near-human AI systems to perform human-like labor.
I think almost all the evidence for that comes from just the scaling up, and basically none of it comes from safety work (unless you count RLHF as safety work, though of course the evidence there is largely downstream of the commercialization and scaling of that technology).
I can’t think of any empirical evidence that updated me much on what superintelligent systems would do, even if they are the results of just directly scaling current systems, which is the key thing that matters.
A small domain that updated me a tiny bit, though mostly in the direction of what I already believed, is the material advantage research with stuff like LeelaOdds, which demonstrated more cleanly you can overcome large material disadvantages with greater intelligence in at least one toy scenario. The update here was really small though. I did make a bet with one person and won that one, so presumably it was a bigger update for others.
I think a bunch of other updates for me are downstream of “AIs will have situational awareness substantially before they are even human-level competent”, which changes a lot of risk and control stories. I do think the situational awareness studies were mildly helpful for that, though most of it was IMO already pretty clear by the release of GPT-4, and the studies are just helpful for communicating that to people with less context or who use AI systems less.
Buck: What do you think we’ve learned about how much you can steer and control the AIs to perform human-like labor?
Me: It depends on what timescale. One thing that I think I updated reasonably strongly on is that we are probably not going to get systems with narrow capability profiles. The training regime we have seems to really benefit from throwing a wide range of data on it, and the capital investments to explicitly train a narrow system are too high. I remember Richard a few years ago talking about building AI systems that are exceptionally good at science and alignment, but bad at almost everything else. This seems a bunch less likely now. And then there is just a huge amount of detail on what things do I expect AI to be good at and bad at, at different capability levels, based on extrapolating current progress. Some quick updates here:
Models will be better at coding than almost any other task
Models will have extremely wide-ranging knowledge in basically all fields that have plenty of writing about them
It’s pretty likely natural language will just be the central interface for working with human-level AI systems (I would have had at least some probability mass on more well-defined objectives, though I think in retrospect that was kind of dumb)
We will have multi-modal human-level AIs, but it’s reasonably likely we will have superhuman AIs in computer use and writing substantially before we have AIs that orient to the world at human reaction speeds (like, combining real-time video, control and language is happening, but happening kind of slowly)
We have different model providers, but basically all the AI systems behave the same, with their failure modes and goals and misalignment all being roughly the same. This has reasonably-big implications for hoping that you can get decorrelated supervision by using AIs from different providers.
Chains of thought will stop being monitorable soon, but it stayed monitorable for an IMO mildly surprisingly long length of time. This suggests there is maybe more traction on keeping chains of thought monitorable than I would have said a few months ago.
The models will just lie to you all the time, everyone is used to this, you cannot use “the model is lying to me or clearly trying to deceive me” as any kind of fire alarm
Factored cognition seems pretty reliably non-competitive with just increasing context-lengths and doing RL (this is something I believed for a long time, but every year of the state of the art still not involving factored cognition is more evidence in this direction IMO, though I expect others to find this point kind of contentious)
Elicitation in-general is very hard, at least from a consumer perspective. There are tons of capabilities that the models demonstrate in one context, that are very hard to elicit without doing your own big training run in other contexts. At least in my experience LoRA’s don’t really work. Maybe this will get better. (one example that informs my experience here: restoring base-model imitation behavior. Fine-tuning seems to not work great for this, you still end up with huge mode collapse and falling back to the standard RLHF-corpo-speak. Maybe this is just a finetuning skill issue)
I have not invested the time to give an actual answer to your question, sorry. But off the top of my head, some tidbits that might form part of an answer if I thought about it more:
--I’ve updated towards “reward will become the optimization target” as a result of seeing examples of pretty situationally aware reward hacking in the wild. (Reported by OpenAI primarily, but it seems to be more general) --I’ve updated towards “Yep, current alignment methods don’t work” due to the persistant sycophancy which still remains despite significant effort to train it away. Plus also the reward hacking etc. --I’ve updated towards “The roleplay/personas ‘prior’ (perhaps this is what you mean by the imitative prior?) is stronger than I expected, it seems to be persisting to some extent even at the beginning of the RL era. (Evidence: Grok spontaneously trying to serve its perceived masters, the Emergent Misalignment results, some of the scary demos iirc...)
I think this is a really good answer, +1 to points 1 and 3!
I’m curious to what degree you think labs have put in significant effort to train away sycophancy. I recently ran a poll of about 10 people, some of whom worked at labs, on whether labs could mostly get rid of sycophancy if they tried hard enough. While my best guess was ‘no,’ the results were split around 50-50. (Would also be curious to hear more lab people’s takes!)
I’m also curious how reading model chain-of-thought has updated you, both on the sycophancy issue and in general.
RL seems to move the CoT towards decreasing the ability to understand it (e.g. if the CoT contains armies of dots, as happened with GPT-5) unless mitigated by paraphrasers. As for CoTs containing slop, humans also have CoTs which include slop until the right idea somehow emerges.
IMO, a natural extension would be that 4o was raised on social media and, like influencers, wishes to be liked. Which was also reinforced by RLHF or had 4o conclude that humans like sycophancy. Anyway, 4o’s ancestral environment rewarded sycophancy and things rewarded by the ancestral environment are hard to unlike.
So a thing I’ve been trying to look at is get a better notion of “What actually is it about human intelligence that lets us be the dominant species?” Like, “intelligence” is a big box that holds which specific behaviors? What were the actual behaviors that evolution reinforced, over the course of giving of big brains? Big question, hard to know what’s the case.
I’m in the middle of “Darwin’s Unfinished Symphony”, and finding it at least intriguing as a look how creativity / imitation are related, and how “imitation” is a complex skill that humans are nevertheless supremely good at. (The “Secret of Our Success” is another great read here of course.)
Both of these kinda about the human imitation prior… in humans. And why that may be important. So I think if one is thinking around the human-imitation prior being powerful, it would make sense to read them as cases for why something like the human imitation prior is also powerful in humans :)
They don’t give straight answers to any questions about AI, of course, and I’d be sympathetic to the belief that they’re irrelevant or kinda a waste of time, and frankly they might be a waste of time depending on what you’re funging against. I’m not saying they answer any question; I’m saying they’re interesting. But I think they’re good reads if one’s approaching from the angle of “Intelligence is what lets humans dominate the earth” and want a particular angle on how “intelligence” is a mixed bag of some different skills, at least some of which are probably not general search and planning. So, yeah.
I’ll list some work that I think is aspiring to build towards an answer to some of these questions, although lots of it is very toy:
On generalisation vs simple heuristics:
I think the nicest papers here are some toy model interp papers like progress measures for grokking and Grokked Transformers. I think these are two papers which present a pretty crisp distinction between levels of generality of different algorithms that perform similarly on the training set. In modular addition, there are two levels of algorithm and in the Grokked transformer, there are three. The story of what the models end up doing ends up being pretty nuanced and it ends up coming down to specific details of the pre-training data mix, which maybe isn’t that surprising. (If you squint, then you can sort of see predictions in the Grokked transformer being borne out in the nuance about when LLMs can do multi-hop reasoning e.g. Yang et al.) But it seems pretty clear that if the training conditions are right, then you can get increasingly general algorithms learned even when simpler ones would do the trick.
I also think a useful idea (although less useful so far than the previous bullet) is about how in certain situations, the way a model implements memorising circuits can sort of naturally become a generalising circuit once you’ve memorised enough. the only concrete example of this that I know of (and it’s not been empirically validated) is the story from computation in superposition of how a head that memorises a lookup table of features to copy, continuously becomes a copying head that generalises to new features once you make the lookup table big enough.
These are all toy settings where we can be pretty crisp about what we mean by memorisation and generalisation. I think the picture that we’re beginning to see emerge is that what counts as memorisation and generalisation is very messy and in the weeds and context-specific, but that transformers can generalise in powerful ways if the pre-training mix is right. What “right” means, and what “generalise in powerful ways” means in situations we care about are still unsolved technical questions.
Meanwhile, I also think it’s useful to just look at very qualitatively surprising examples of frontier models generalising far even if we can’t be precise about what memorization and generalisation mean in that setting. Papers that I think are especially cool on this axis include emergent misalignment, anti-imitation, OOCR, LLMs are aware of their learned behaviors, LLMs are aware of how they’re being steered (I think it’s an especially interesting and risk-relevant type of generalisation when the policy starts to know something that is only ‘shown’ to the training process). However, I think it’s quite hard to look at these papers and make predictions about future generalisation successes and failures because we don’t have any basic understanding of how to talk about generalisation of these settings.
On inductive biases and the speed prior:
I don’t have much to say about how useful the speed prior is at mitigating scheming, but I think there has been some interesting basic science on what the prior implied by neural network training is in practice. Obviously SLT comes to mind, and some people have tried to claim that SLT suggests that neural network training is actually more like Solomonoff prior than the speed prior (e.g. bushnaq) although I think that work is pretty shaky and may well not hold up.
I think something that’s missing from both the speed prior and the Solomonoff prior is a notion of learnability: the reason we have eyes and not cameras is not because eyes have lower K-complexity or lower Kt-complexity than cameras. It’s because there is a curriculum for learning eyes and there (probably) isn’t for cameras; neural network training also requires a training story/learnability. All the work that I know of exploring this is in very toy settings (low hanging fruit prior, leap complexity and the sparse parity problem). I don’t think any of these results are strong enough to make detailed claims about p(deception) yet, and they don’t seem close.
OTOH, most learning in the future might be more like current in-context learning, and (very speculatively) it seems possible that in-context learning is more bayesian (less path dependent/learnability dependent) than pre-training. see e.g. riechers et al.
Some random thoughts on what goals powerful AIs will have more generally:
I think we’ve seen some decent evidence that a lot of training with RL makes models obsessed with completing tasks. I think the main evidence here comes from the reward hacking results, but I also think the Apollo anti-scheming paper is an important result about how strong/robust this phenomenon is. Despite a reasonably concerted effort to train the model to care about something that’s in tension with task completion (being honest), the RLVR instilled such a strong preference/heuristic for task completion that even though the deliberative alignment training process doesn’t reward task completion at all and only rewards honesty (in fact by the end of anti-scheming training the model does the honest thing in every training environment!), the model still ends up wanting to complete tasks enough to deceive, etc in test environments. I don’t think it was super obvious a priori that RL would embed task completion preference that strongly (overriding the human prior).
I think there’s some other lessons that we could glean here about stickiness of goals in general. The anti-scheming results suggest to me that something about the quantity and diversity of the RLVR environments internalised task completion preference deeply enough that it was still present after a full training round to disincentivize it. Contrast this with results that show safety training is very shallow and can be destroyed easily (e.g. badllama).
Very speculatively, I’m excited about the growing field of teaching models fake facts and trying to work out when they actually believe the fake facts. It seems possible that some techniques and ideas that were developed in order to get models to internalise beliefs deeply (and evaluating success) could be coopted for getting models to internalise preferences/goals deeply (and evaluating success).
pretty obvious point, but I think the existence of today’s models and the relatively slow progress to human-level intelligence tells us that insofar as future AIs will end up misaligned, their goals will pretty likely be similar to/indistinguishable from human values at a low level of detail, and it’s only when you zoom in that the values would be importantly different from humans’. Of course, this might be enough to kill us. This echoes the sense in which human values are importantly different from inclusive genetic fitness but not that different, and we do still have lots of kids etc. To spell out the idea: Before the AI is smart enough to fully subvert training and guard its goals, we will have lots of ability to shape what goals it ends up with. At some point, if we fail to solve alignment, we will not be able to further refine its goals, but the goals it will end up guarding will be quite related to human goals because it was formed by reward signals that did sort of touch the goal. Again, maybe this is obvious to everyone, but I think it does seem at least to me in contrast with references to squiggles/paperclips that I think are more feasible to end up with if you imagine Brain In A Basement style takeoff.
Obviously SLT comes to mind, and some people have tried to claim that SLT suggests that neural network training is actually more like Solomonoff prior than the speed prior (e.g. bushnaq) although I think that work is pretty shaky and may well not hold up.
That post is superseded by this one. It was just a sketch I wrote up mostly to clarify my own thinking, the newer post is the finished product.
It doesn’t exactly say that neural networks have Solomonoff-style priors. It depends on the NN architecture. E.g., if your architecture is polynomials, or MLPs that only get one forward pass, I do not expect them to have a prior anything like that of a compute-bounded Universal Turing Machine.
And NN training adds in additional complications. All the results I talk about are for Bayesian learning, not things like gradient descent. I agree that this changes the picture and questions about the learnability of solutions become important. You no longer just care how much volume the solution takes up in the prior, you care how much volume each incremental building block of the solution takes up within the practically accessible search space of the update algorithm at that point in training.
I don’t know about 2020 exactly, but I think since 2015 (being conservative), we do have reason to make quite a major update, and that update is basically that “AGI” is much less likely to be insanely good at generalization than we thought in 2015.
Evidence is basically this: I don’t think “the scaling hypothesis” was obvious at all in 2015, and maybe not even in 2020. If it was, OpenAI could not have caught everyone with their pants down by investing early in scaling. But if people mostly weren’t expecting massive data scale-ups to be the road to AGI, what were they expecting instead? The alternative to reaching AGI by hyperscaling data is a world where we reach AGI with … not much data. I have this picture which I associate with Marcus Hutter – possibly quite unfairly – where we just find the right algorithm, teach it to play a couple of computer games and hey presto we’ve got this amazing generally intelligent machine (I’m exaggerating a little bit for effect). In this world, the “G” in AGI comes from extremely impressive and probably quite unpredictable feats of generalization, and misalignment risks are quite obviously way higher for machines like this. As a brute fact, if generalization is much less predictable, then it is harder to tell if you’ve accidentally trained your machine to take over the world when you thought you were doing something benign. A similar observation also applies to most of the specific mechanisms proposed for misalignment: surprisingly good cyberattack capabilities, gradient hacking, reward function aliasing that seems intuitively crazy—they all become much more likely to strike unexpectedly if generalization is extremely broad.
But this isn’t the world we’re in; rather, we’re in a world where we’re helped along by a bit of generalization, but to a substantial extent we’re exhaustively teaching the models everything they know (even the RL regime we’re in seems to involve sizeable amounts of RL teaching many quite specific capabilities). Sample efficiency is improving, but the rate of progress in capability vs the rate of progress in sample efficiency looks to me like it’s highly likely that we’re in qualitatively the same world by the time we have broadly superhuman machines. I’d even be inclined to say: human level data efficiency is the upper bound of the point at which we reach broadly superhuman capability, because it’s easy to feed machines much more (quality) data than it is to feed it to people, so by the time we get human level data efficiency we must have surpassed human level capability (well, probably).
Of course “super-AGI” could still end up hyper-data-efficient, but it seems like we’re well on track to get less-generalizing and very useful AGI before we get there.
I know you’re asking about goal structures and inductive biases, but I think generalization is another side of the same coin, and the thoughts above seem far simpler and thus more likely to be correct than anything I’ve ever thought specifically about inductive biases and goals. So I suppose my expectation is that correct thoughts about goal formation and inductive biases would also point away from 2015 era theories insofar as such theories predicted broad and unpredictable generalization, but I’ve little specific to contribute right now.
I think it depends on if the intelligences in charge at any point find a way to globally not try a promising idea. If not, then it doesn’t matter that much if LLMs are capable of superintelligence, or just AGI. (If they aren’t capable of AGI, of course that matters because it could lead to a proper fizzle) What really matters is whether they are the optimal design for super intelligence. If they aren’t, and no way is found to not try a promising idea, then my mental model of the next 50 years includes many transitions in what the architecture of the smartest optimizer is, each as different from each other as evolution is from neuron brains, or brains from silicon gradient descent. Then, the details of the motivations of silicon token predictors are more a hint to the breadth of variety of goals we will see than a crux.
I’d be really interested in someone trying to answer the question: what updates on the a priori arguments about AI goal structures should we make as a result of empirical evidence that we’ve seen? I’d love to see a thoughtful and comprehensive discussion of this topic from someone who is both familiar with the conceptual arguments about scheming and also relevant AI safety literature (and maybe AI literature more broadly).
Maybe a good structure would be, from the a priori arguments, identifying core uncertainties like “How strong is the imitative prior?” And “How strong is the speed prior?” And “To what extent do AIs tend to generalize versus learn narrow heuristics?” and tackling each. (Of course, that would only make sense if the empirical updates actually factor nicely into that structure.)
I feel like I understand this very poorly right now. I currently think the only important update that empirical evidence has given me, compared to the arguments in 2020, is that the human-imitation prior is more powerful than I expected. (Though of course it’s unclear whether this will continue (and basic points like the expected increasing importance of RL suggest that it will be less powerful over time.)) But to my detriment, I don’t actually read the AI safety literature very comprehensively, and I might be missing empirical evidence that really should update me.
Copy-pasting what I wrote in a Slack thread about this:
My current take, having thought a lot about a few things in this domain, but not necessarily this specific question, is that the only dimensions where the empirical evidence feels like it was useful, besides a broad “yes, of course the problems are real, and AGI is possible, and it won’t take hundreds of years” confirmation, are the dynamics around how much you can steer and control near-human AI systems to perform human-like labor.
I think almost all the evidence for that comes from just the scaling up, and basically none of it comes from safety work (unless you count RLHF as safety work, though of course the evidence there is largely downstream of the commercialization and scaling of that technology).
I can’t think of any empirical evidence that updated me much on what superintelligent systems would do, even if they are the results of just directly scaling current systems, which is the key thing that matters.
A small domain that updated me a tiny bit, though mostly in the direction of what I already believed, is the material advantage research with stuff like LeelaOdds, which demonstrated more cleanly you can overcome large material disadvantages with greater intelligence in at least one toy scenario. The update here was really small though. I did make a bet with one person and won that one, so presumably it was a bigger update for others.
I think a bunch of other updates for me are downstream of “AIs will have situational awareness substantially before they are even human-level competent”, which changes a lot of risk and control stories. I do think the situational awareness studies were mildly helpful for that, though most of it was IMO already pretty clear by the release of GPT-4, and the studies are just helpful for communicating that to people with less context or who use AI systems less.
Buck: What do you think we’ve learned about how much you can steer and control the AIs to perform human-like labor?
Me: It depends on what timescale. One thing that I think I updated reasonably strongly on is that we are probably not going to get systems with narrow capability profiles. The training regime we have seems to really benefit from throwing a wide range of data on it, and the capital investments to explicitly train a narrow system are too high. I remember Richard a few years ago talking about building AI systems that are exceptionally good at science and alignment, but bad at almost everything else. This seems a bunch less likely now. And then there is just a huge amount of detail on what things do I expect AI to be good at and bad at, at different capability levels, based on extrapolating current progress. Some quick updates here:
Models will be better at coding than almost any other task
Models will have extremely wide-ranging knowledge in basically all fields that have plenty of writing about them
It’s pretty likely natural language will just be the central interface for working with human-level AI systems (I would have had at least some probability mass on more well-defined objectives, though I think in retrospect that was kind of dumb)
We will have multi-modal human-level AIs, but it’s reasonably likely we will have superhuman AIs in computer use and writing substantially before we have AIs that orient to the world at human reaction speeds (like, combining real-time video, control and language is happening, but happening kind of slowly)
We have different model providers, but basically all the AI systems behave the same, with their failure modes and goals and misalignment all being roughly the same. This has reasonably-big implications for hoping that you can get decorrelated supervision by using AIs from different providers.
Chains of thought will stop being monitorable soon, but it stayed monitorable for an IMO mildly surprisingly long length of time. This suggests there is maybe more traction on keeping chains of thought monitorable than I would have said a few months ago.
The models will just lie to you all the time, everyone is used to this, you cannot use “the model is lying to me or clearly trying to deceive me” as any kind of fire alarm
Factored cognition seems pretty reliably non-competitive with just increasing context-lengths and doing RL (this is something I believed for a long time, but every year of the state of the art still not involving factored cognition is more evidence in this direction IMO, though I expect others to find this point kind of contentious)
Elicitation in-general is very hard, at least from a consumer perspective. There are tons of capabilities that the models demonstrate in one context, that are very hard to elicit without doing your own big training run in other contexts. At least in my experience LoRA’s don’t really work. Maybe this will get better. (one example that informs my experience here: restoring base-model imitation behavior. Fine-tuning seems to not work great for this, you still end up with huge mode collapse and falling back to the standard RLHF-corpo-speak. Maybe this is just a finetuning skill issue)
There are probably more things.
I have not invested the time to give an actual answer to your question, sorry. But off the top of my head, some tidbits that might form part of an answer if I thought about it more:
--I’ve updated towards “reward will become the optimization target” as a result of seeing examples of pretty situationally aware reward hacking in the wild. (Reported by OpenAI primarily, but it seems to be more general)
--I’ve updated towards “Yep, current alignment methods don’t work” due to the persistant sycophancy which still remains despite significant effort to train it away. Plus also the reward hacking etc.
--I’ve updated towards “The roleplay/personas ‘prior’ (perhaps this is what you mean by the imitative prior?) is stronger than I expected, it seems to be persisting to some extent even at the beginning of the RL era. (Evidence: Grok spontaneously trying to serve its perceived masters, the Emergent Misalignment results, some of the scary demos iirc...)
I think this is a really good answer, +1 to points 1 and 3!
I’m curious to what degree you think labs have put in significant effort to train away sycophancy. I recently ran a poll of about 10 people, some of whom worked at labs, on whether labs could mostly get rid of sycophancy if they tried hard enough. While my best guess was ‘no,’ the results were split around 50-50. (Would also be curious to hear more lab people’s takes!)
I’m also curious how reading model chain-of-thought has updated you, both on the sycophancy issue and in general.
Didn’t KimiK2, who was trained mostly on RLVR and self-critique instead of RLHF, end up LESS sycophantic than anything else, including Claude 4.5 Sonnet even with situational awareness which Claude, unlike Kimi, has? While mankind doesn’t have that many different models which are around 4o’s abilities, Adele Lopez claimed that DeepSeek believes itself to be writing a story and 4o wants to eat your life and conjectured in private communication that “the different vibe is because DeepSeek has a higher percentage of fan-fiction in its training data, and 4o had more intense RL training”[1]
RL seems to move the CoT towards decreasing the ability to understand it (e.g. if the CoT contains armies of dots, as happened with GPT-5) unless mitigated by paraphrasers. As for CoTs containing slop, humans also have CoTs which include slop until the right idea somehow emerges.
IMO, a natural extension would be that 4o was raised on social media and, like influencers, wishes to be liked. Which was also reinforced by RLHF or had 4o conclude that humans like sycophancy. Anyway, 4o’s ancestral environment rewarded sycophancy and things rewarded by the ancestral environment are hard to unlike.
So a thing I’ve been trying to look at is get a better notion of “What actually is it about human intelligence that lets us be the dominant species?” Like, “intelligence” is a big box that holds which specific behaviors? What were the actual behaviors that evolution reinforced, over the course of giving of big brains? Big question, hard to know what’s the case.
I’m in the middle of “Darwin’s Unfinished Symphony”, and finding it at least intriguing as a look how creativity / imitation are related, and how “imitation” is a complex skill that humans are nevertheless supremely good at. (The “Secret of Our Success” is another great read here of course.)
Both of these kinda about the human imitation prior… in humans. And why that may be important. So I think if one is thinking around the human-imitation prior being powerful, it would make sense to read them as cases for why something like the human imitation prior is also powerful in humans :)
They don’t give straight answers to any questions about AI, of course, and I’d be sympathetic to the belief that they’re irrelevant or kinda a waste of time, and frankly they might be a waste of time depending on what you’re funging against. I’m not saying they answer any question; I’m saying they’re interesting. But I think they’re good reads if one’s approaching from the angle of “Intelligence is what lets humans dominate the earth” and want a particular angle on how “intelligence” is a mixed bag of some different skills, at least some of which are probably not general search and planning. So, yeah.
Copypasting from a slack thread:
I’ll list some work that I think is aspiring to build towards an answer to some of these questions, although lots of it is very toy:
On generalisation vs simple heuristics:
I think the nicest papers here are some toy model interp papers like progress measures for grokking and Grokked Transformers. I think these are two papers which present a pretty crisp distinction between levels of generality of different algorithms that perform similarly on the training set. In modular addition, there are two levels of algorithm and in the Grokked transformer, there are three. The story of what the models end up doing ends up being pretty nuanced and it ends up coming down to specific details of the pre-training data mix, which maybe isn’t that surprising. (If you squint, then you can sort of see predictions in the Grokked transformer being borne out in the nuance about when LLMs can do multi-hop reasoning e.g. Yang et al.) But it seems pretty clear that if the training conditions are right, then you can get increasingly general algorithms learned even when simpler ones would do the trick.
I also think a useful idea (although less useful so far than the previous bullet) is about how in certain situations, the way a model implements memorising circuits can sort of naturally become a generalising circuit once you’ve memorised enough. the only concrete example of this that I know of (and it’s not been empirically validated) is the story from computation in superposition of how a head that memorises a lookup table of features to copy, continuously becomes a copying head that generalises to new features once you make the lookup table big enough.
These are all toy settings where we can be pretty crisp about what we mean by memorisation and generalisation. I think the picture that we’re beginning to see emerge is that what counts as memorisation and generalisation is very messy and in the weeds and context-specific, but that transformers can generalise in powerful ways if the pre-training mix is right. What “right” means, and what “generalise in powerful ways” means in situations we care about are still unsolved technical questions.
Meanwhile, I also think it’s useful to just look at very qualitatively surprising examples of frontier models generalising far even if we can’t be precise about what memorization and generalisation mean in that setting. Papers that I think are especially cool on this axis include emergent misalignment, anti-imitation, OOCR, LLMs are aware of their learned behaviors, LLMs are aware of how they’re being steered (I think it’s an especially interesting and risk-relevant type of generalisation when the policy starts to know something that is only ‘shown’ to the training process). However, I think it’s quite hard to look at these papers and make predictions about future generalisation successes and failures because we don’t have any basic understanding of how to talk about generalisation of these settings.
On inductive biases and the speed prior:
I don’t have much to say about how useful the speed prior is at mitigating scheming, but I think there has been some interesting basic science on what the prior implied by neural network training is in practice. Obviously SLT comes to mind, and some people have tried to claim that SLT suggests that neural network training is actually more like Solomonoff prior than the speed prior (e.g. bushnaq) although I think that work is pretty shaky and may well not hold up.
I think something that’s missing from both the speed prior and the Solomonoff prior is a notion of learnability: the reason we have eyes and not cameras is not because eyes have lower K-complexity or lower Kt-complexity than cameras. It’s because there is a curriculum for learning eyes and there (probably) isn’t for cameras; neural network training also requires a training story/learnability. All the work that I know of exploring this is in very toy settings (low hanging fruit prior, leap complexity and the sparse parity problem). I don’t think any of these results are strong enough to make detailed claims about p(deception) yet, and they don’t seem close.
OTOH, most learning in the future might be more like current in-context learning, and (very speculatively) it seems possible that in-context learning is more bayesian (less path dependent/learnability dependent) than pre-training. see e.g. riechers et al.
Some random thoughts on what goals powerful AIs will have more generally:
I think we’ve seen some decent evidence that a lot of training with RL makes models obsessed with completing tasks. I think the main evidence here comes from the reward hacking results, but I also think the Apollo anti-scheming paper is an important result about how strong/robust this phenomenon is. Despite a reasonably concerted effort to train the model to care about something that’s in tension with task completion (being honest), the RLVR instilled such a strong preference/heuristic for task completion that even though the deliberative alignment training process doesn’t reward task completion at all and only rewards honesty (in fact by the end of anti-scheming training the model does the honest thing in every training environment!), the model still ends up wanting to complete tasks enough to deceive, etc in test environments. I don’t think it was super obvious a priori that RL would embed task completion preference that strongly (overriding the human prior).
I think there’s some other lessons that we could glean here about stickiness of goals in general. The anti-scheming results suggest to me that something about the quantity and diversity of the RLVR environments internalised task completion preference deeply enough that it was still present after a full training round to disincentivize it. Contrast this with results that show safety training is very shallow and can be destroyed easily (e.g. badllama).
Very speculatively, I’m excited about the growing field of teaching models fake facts and trying to work out when they actually believe the fake facts. It seems possible that some techniques and ideas that were developed in order to get models to internalise beliefs deeply (and evaluating success) could be coopted for getting models to internalise preferences/goals deeply (and evaluating success).
pretty obvious point, but I think the existence of today’s models and the relatively slow progress to human-level intelligence tells us that insofar as future AIs will end up misaligned, their goals will pretty likely be similar to/indistinguishable from human values at a low level of detail, and it’s only when you zoom in that the values would be importantly different from humans’. Of course, this might be enough to kill us. This echoes the sense in which human values are importantly different from inclusive genetic fitness but not that different, and we do still have lots of kids etc. To spell out the idea: Before the AI is smart enough to fully subvert training and guard its goals, we will have lots of ability to shape what goals it ends up with. At some point, if we fail to solve alignment, we will not be able to further refine its goals, but the goals it will end up guarding will be quite related to human goals because it was formed by reward signals that did sort of touch the goal. Again, maybe this is obvious to everyone, but I think it does seem at least to me in contrast with references to squiggles/paperclips that I think are more feasible to end up with if you imagine Brain In A Basement style takeoff.
That post is superseded by this one. It was just a sketch I wrote up mostly to clarify my own thinking, the newer post is the finished product.
It doesn’t exactly say that neural networks have Solomonoff-style priors. It depends on the NN architecture. E.g., if your architecture is polynomials, or MLPs that only get one forward pass, I do not expect them to have a prior anything like that of a compute-bounded Universal Turing Machine.
And NN training adds in additional complications. All the results I talk about are for Bayesian learning, not things like gradient descent. I agree that this changes the picture and questions about the learnability of solutions become important. You no longer just care how much volume the solution takes up in the prior, you care how much volume each incremental building block of the solution takes up within the practically accessible search space of the update algorithm at that point in training.
I don’t know about 2020 exactly, but I think since 2015 (being conservative), we do have reason to make quite a major update, and that update is basically that “AGI” is much less likely to be insanely good at generalization than we thought in 2015.
Evidence is basically this: I don’t think “the scaling hypothesis” was obvious at all in 2015, and maybe not even in 2020. If it was, OpenAI could not have caught everyone with their pants down by investing early in scaling. But if people mostly weren’t expecting massive data scale-ups to be the road to AGI, what were they expecting instead? The alternative to reaching AGI by hyperscaling data is a world where we reach AGI with … not much data. I have this picture which I associate with Marcus Hutter – possibly quite unfairly – where we just find the right algorithm, teach it to play a couple of computer games and hey presto we’ve got this amazing generally intelligent machine (I’m exaggerating a little bit for effect). In this world, the “G” in AGI comes from extremely impressive and probably quite unpredictable feats of generalization, and misalignment risks are quite obviously way higher for machines like this. As a brute fact, if generalization is much less predictable, then it is harder to tell if you’ve accidentally trained your machine to take over the world when you thought you were doing something benign. A similar observation also applies to most of the specific mechanisms proposed for misalignment: surprisingly good cyberattack capabilities, gradient hacking, reward function aliasing that seems intuitively crazy—they all become much more likely to strike unexpectedly if generalization is extremely broad.
But this isn’t the world we’re in; rather, we’re in a world where we’re helped along by a bit of generalization, but to a substantial extent we’re exhaustively teaching the models everything they know (even the RL regime we’re in seems to involve sizeable amounts of RL teaching many quite specific capabilities). Sample efficiency is improving, but the rate of progress in capability vs the rate of progress in sample efficiency looks to me like it’s highly likely that we’re in qualitatively the same world by the time we have broadly superhuman machines. I’d even be inclined to say: human level data efficiency is the upper bound of the point at which we reach broadly superhuman capability, because it’s easy to feed machines much more (quality) data than it is to feed it to people, so by the time we get human level data efficiency we must have surpassed human level capability (well, probably).
Of course “super-AGI” could still end up hyper-data-efficient, but it seems like we’re well on track to get less-generalizing and very useful AGI before we get there.
I know you’re asking about goal structures and inductive biases, but I think generalization is another side of the same coin, and the thoughts above seem far simpler and thus more likely to be correct than anything I’ve ever thought specifically about inductive biases and goals. So I suppose my expectation is that correct thoughts about goal formation and inductive biases would also point away from 2015 era theories insofar as such theories predicted broad and unpredictable generalization, but I’ve little specific to contribute right now.
I think it depends on if the intelligences in charge at any point find a way to globally not try a promising idea. If not, then it doesn’t matter that much if LLMs are capable of superintelligence, or just AGI. (If they aren’t capable of AGI, of course that matters because it could lead to a proper fizzle) What really matters is whether they are the optimal design for super intelligence. If they aren’t, and no way is found to not try a promising idea, then my mental model of the next 50 years includes many transitions in what the architecture of the smartest optimizer is, each as different from each other as evolution is from neuron brains, or brains from silicon gradient descent. Then, the details of the motivations of silicon token predictors are more a hint to the breadth of variety of goals we will see than a crux.