Where was the argument for consequentialism (including intense optimisation) and imitation being the only two ways to do impressive feats?
Also, will you change your mind if the current paradigm is still non-psychopathic even when RL training dominates?
(Big picture, I think the main place I might get off the train is in expecting future AI development to use a mix of rewards, including some from other equally capable AIs judging that behaviors aren’t deceptive/unintended/misaligned. And this mirroring the role that “other humans think we’re nice” played in evolution)
Where was the argument for consequentialism (including intense optimisation) and imitation being the only two ways to do impressive feats?
Well, you can also do impressive feats via conventional programming / GOFAI too, but I don’t think you get ASI that way. What else? I dunno, but I think if there was another big-picture approach that plausibly gets to ASI, lots of people would be working on it, and I would have heard of it. Lmk if you think I’m forgetting something.
Also, will you change your mind if the current paradigm is still non-psychopathic even when RL training dominates?
Normally if someone says “RL training dominates”, they mean “the amount of compute applied to RLVR is much greater than the amount of compute applied to pretraining”. That’s very different from “RLVR is so important that the impacts of pretraining are diluted away to irrelevancy”. (E.g. discussions of information efficiency by Toby Ord and by Dwarkesh.) But the latter is what would be relevant here.
Hmm, here are three example scenarios:
If a company did away with pretraining entirely, i.e. just used RLVR from random initialization, and got a non-psychopathic result—I would DEFINITELY feel confused.
If a company did enough RLVR that the chains of thought became completely unrecognizable in any language (e.g. a chain of thought like “…5BnSjYEkIokhPiTePWBlIO1FIwQUOg7PvJ…”) (per the Karpathy quote: “You know you did RL right when the models stop thinking in English”), and got a non-psychopathic result—I would PROBABLY feel confused.
If we stopped seeing papers like Karan & Du 2025, or Venhoff et al. 2025, or Yue et al. 2025, and instead saw results that were astronomically improbable (e.g. 10^-100) with the base model, even with arbitrarily fancy sampling techniques—I would MAYBE feel confused.
“Feeling confused” is weaker than “changing my mind”, because maybe I would puzzle over it and find some way to make sense of it. But also maybe not. Probably I could make a stronger statement / prediction if I spent a bunch of time thinking about it, but hopefully this gives some sense of what I have in mind.
other equally capable AIs judging that behaviors aren’t deceptive/unintended/misaligned
And this mirroring the role that “other humans think we’re nice” played in evolution
You’re kinda pointing to a challenge to my view. My view is: a hypothetical smart consequentialist human with a ruthless drive to have lots of grandkids will have more grandkids than a human with the normal suite of innate drives, like falling in love and so on. Proof: strategy-stealing. Whatever the latter human does, if it’s an objectively good way to have lots of grandkids, then the former human can notice that it’s a good strategy and do the same thing.
And then the challenge to that view is: …But we did actually evolve all these innate drives that make us intrinsically desire love and curiosity etc. Doesn’t that prove my strategy-stealing argument wrong?
I think there’s a good answer to that challenge, and it’s some combination of: (1) evolution has no way to build “a ruthless drive to have lots of grandkids” into our brains (details), and (2) even if it did, humans are not sufficiently smart and strategic in regards to long-term planning to be very effective ruthless consequentialists. (“We are the least intelligent species capable of building an industrial civilization.”)
Anyway, I claim that neither (1) nor (2) would be applicable in your AI training scenario (as I understand it), so the strategy-stealing argument would go through, and thus the RL selection pressure should at best be neutral between ruthless scheming strategies vs intrinsically honest ones, and much more likely favor the former.
(I am however assuming no interpretability / mind-reading.)
Bootstrapping from very smart human intelligence by doing loads of serial thinking, writing great non-sociopathic outputs, and find sft on them
Sophisticated scaffolds—basically gofai but you can insert LLMs in all over the place
Searching over different the above and RL and imitation to things for methods that don’t produce sociopaths… Evolution and within lifetime learning don’t produce sociopaths despite this not happening!!
Re RL: thanks for those examples. I’d have defined it via smg like# data points or amount of training signal -agree compute is the wrong measure.
> other equally capable AIs judging that behaviors aren’t deceptive/unintended/misaligned
I’m pessimistic mainly for reasons discussed in “‘Behaviorist’ RL reward functions lead to scheming” §3
Hmm but the cognition “try to break the rules without getting caught” might in practice lead to worse reward than just trying to follow the rules, if the chances to secretly hack are sufficiently rare and the punishment for getting caught is high enough.
Eg I think that if I tried to “pretend I cared about my friends/partner but exploit them when I can get away with it” would MASSIVELY fail me in the long run. Ppl are good at reading ppl. We’re transparent to each other. There is big cognitive overhead to maintaining two narratives. I’d end up with fewer committed/deep long-term friendships.
Relatedly, if you’re in a cult, the best way to signal deep commitment is to become a true believer.
And it will, by comparison, be way easier for the overseers in the case of AI. Interp. Seeing all behaviour. Running counterfactual experiments.
think there’s a good answer to that challenge, and it’s some combination of: (1) evolution has no way to build “a ruthless drive to have lots of grandkids” into our brains (details), and (2) even if it did, humans are not sufficiently smart and strategic in regards to long-term planning to be very effective ruthless consequentialists. (“We are the least intelligent species capable of building an industrial civilization.”)
Ok interesting, we have pretty different intuitions here!
On 1, the equ that evolution chose, is way less sociopathic than it could have chosen. Some ppl are sociopaths! So it can be done. (Maybe there’s a lot of work done by the “aim for grandkids” part, sorry I didn’t read the link. But ppl do have some desires for healthy kids and grandkids)
On 2, I agree that if one agent’s cognition rises while the “are you a sociopath” checks stay constant, we’re in trouble. But we should imagine both rising with AI capabilities. (Note that humans evolved in a similar situation with both sides rising and the equ was non sociopathy.) Also, the world itself becomes more complex, making scheming plans harder to pull off.
I’m flat out sceptical the sociopath strategy would dominate in equilibrium. Eg: Men sociopaths would trick many women into partnering with them, leaving each. Women evolve to check hard for true commitment. Men evolve credible signals of non-sociopathy. I just don’t believe such signals would be impossible to find.
Zooming out, I recall you thinking that humans aren’t sociopaths bc they have some special specific reward thing that we can’t replicate, related to wanting others to approve. Whereas I see no reason to think it’s some specific thing. We just had some selection pressure to seem like good non-sociopathic allies. That selection pressure worked. If we apply similar selection to ai, it will probably also work—the implementation details won’t need to match some specific human learning circuit
(partly copying from another comment:) If you compare a human in 30000 BC to a human today, our brains are full of new information that wasn’t in the training data of 30000 BC. I want to talk about: what would it look to be in a world where you can put millions of LLMs in a sealed box containing a VR environment, for (the equivalent of) thousands of years, and then we open up the box and find that the LLMs have made an analogous kind of scientific and technological progress?
I don’t think any of those three options can get there (and nor can imitative learning). (I’m disputing the capabilities not alignment here.)
The first one (bootstrapping) has the issue that if the serial thinking is not 100% perfect, then it will sometimes get mistakes, and then you’re SFT’ing on the mistakes, making the model more confident in those mistakes, and then the next round of serial thinking will incorporate and build on those mistakes. Repeat a billion times in a sealed box, and I think it would spiral into nonsense—it would get dumber not smarter.
…I assume that people are already trying to do this, so I guess we’ll find out one way or the other how far it gets. ¯\_(ツ)_/¯ If I’m wrong and it does get to ASI (e.g. the “sealed box” standard above), perhaps that would be good news compared to what I’m expecting … although I suppose it might spiral into misalignment too, not sure.
The second one (scaffolds) has an issue that (IIUC) you’re piling up entire new fields of knowledge into the context window, without that knowledge being present in the weights. LLMs would be very bad at that. We can see them struggle with novel complexity in the context window, even in everyday situations. And this would be much worse. For example, imagine training an LLM before linear algebra existed, and then trying to have it understand linear algebra (matrices, bases, rank, nullity, spans, determinants, trace, eigenvectors, dual space, unitarity, etc.) purely by putting all that stuff in the context window. And then ask the LLM tricky questions that rely on those concepts. I really think it wouldn’t work, and that it will keep not working into the future.
The third one (“searching over…”) I don’t understand. Is there a typo? It sounds like “try to solve the technical alignment problem”, which of course I endorse. I don’t think the problem is fundamentally unsolvable; almost no one thinks that.
Eg I think that if I tried to “pretend I cared about my friends/partner but exploit them when I can get away with it” would MASSIVELY fail me in the long run. … There is big cognitive overhead to maintaining two narratives. I’d end up with fewer committed/deep long-term friendships.
I claim that “maintaining two narratives” is super easy. We do it all the time when we talk about the fictional world of a TV show, and then in the next breath we talk about the actors and script. I think “maintaining two narratives” is hard in social settings because most of us are not sociopaths! I.e., yes, lying can be draining, but I claim it’s emotionally draining rather than cognitively draining.
Anyway, we’re ultimately talking about ASI here, which can develop whole new fields of knowledge etc. Surely it will be able to ask itself “what would the humans be looking for in this scenario?”, and then do that, whenever humans might be watching. I hear that even today’s LLMs do that (“eval awareness”).
“I’d end up with fewer committed/deep long-term friendships” is kinda circular. Relationships are not really “deep” if e.g. you’re indifferent to the other person and just sucking up to them. But that’s only a problem if you wanted a “deep” relationship in the first place.
Ppl are good at reading ppl. We’re transparent to each other. … And it will, by comparison, be way easier for the overseers in the case of AI. Interp. Seeing all behaviour. Running counterfactual experiments.
Yes, interpretability is potentially an important caveat, but I don’t think it adds up to much reason for optimism. According to my worldview: if we found a way to use interpretability to test for scheming, then we could use it, and we would definitely find scheming, because duh, that’s the natural consequence of how we will train ASI. And now what? If we delete the model and re-run the training from scratch, we’ll just get the same result. Or, if we use this interpretability signal for fine-tuning, we have the usual problem that we’re training the AI to hide its thoughts.
I’m skeptical of “counterfactual experiments” because a smart AI will be able to tell what’s the real world, see Distinguishing test from training (Nate Soares 2022), and (again) “eval awareness” in LLMs.
…We just had some selection pressure to seem like good non-sociopathic allies. That selection pressure worked.…
This section (or at least this excerpt) seems to be analogizing evolution to LLM training. Whereas I think a better framework is to say that evolution designed a within-lifetime learning algorithm in the brain, and here we’re having a conversation about how that learning algorithm works. I claim that this learning algorithm is a yet-to-be-invented variant of model-based actor-critic RL, and that it has a weird reward function that (in a certain environment) leads to caring about our friends, and to pride, and to trying to fit in, etc., among many other things.
There was obviously selection pressure for that reward function in the case of humans, and we can keep arguing about why. But would there be one for AIs? I claim that this question is moot, because normal practice in RL does not involve choosing a reward function via an outer-loop blind search analogous to evolution. The reward function is almost always part of the learning algorithm, not a thing that is itself selected by learning. (More discussion here.)
Separately, I can talk about why I think kindness, norm-following, etc. are human innate social drives, as opposed to strategies developed by a more generic within-lifetime learning algorithm, if you’re skeptical of that claim. (Are you?)
Where was the argument for consequentialism (including intense optimisation) and imitation being the only two ways to do impressive feats?
Also, will you change your mind if the current paradigm is still non-psychopathic even when RL training dominates?
(Big picture, I think the main place I might get off the train is in expecting future AI development to use a mix of rewards, including some from other equally capable AIs judging that behaviors aren’t deceptive/unintended/misaligned. And this mirroring the role that “other humans think we’re nice” played in evolution)
Well, you can also do impressive feats via conventional programming / GOFAI too, but I don’t think you get ASI that way. What else? I dunno, but I think if there was another big-picture approach that plausibly gets to ASI, lots of people would be working on it, and I would have heard of it. Lmk if you think I’m forgetting something.
Normally if someone says “RL training dominates”, they mean “the amount of compute applied to RLVR is much greater than the amount of compute applied to pretraining”. That’s very different from “RLVR is so important that the impacts of pretraining are diluted away to irrelevancy”. (E.g. discussions of information efficiency by Toby Ord and by Dwarkesh.) But the latter is what would be relevant here.
Hmm, here are three example scenarios:
If a company did away with pretraining entirely, i.e. just used RLVR from random initialization, and got a non-psychopathic result—I would DEFINITELY feel confused.
If a company did enough RLVR that the chains of thought became completely unrecognizable in any language (e.g. a chain of thought like “…5BnSjYEkIokhPiTePWBlIO1FIwQUOg7PvJ…”) (per the Karpathy quote: “You know you did RL right when the models stop thinking in English”), and got a non-psychopathic result—I would PROBABLY feel confused.
If we stopped seeing papers like Karan & Du 2025, or Venhoff et al. 2025, or Yue et al. 2025, and instead saw results that were astronomically improbable (e.g. 10^-100) with the base model, even with arbitrarily fancy sampling techniques—I would MAYBE feel confused.
“Feeling confused” is weaker than “changing my mind”, because maybe I would puzzle over it and find some way to make sense of it. But also maybe not. Probably I could make a stronger statement / prediction if I spent a bunch of time thinking about it, but hopefully this gives some sense of what I have in mind.
I’m pessimistic mainly for reasons discussed in “‘Behaviorist’ RL reward functions lead to scheming” §3.1.
You’re kinda pointing to a challenge to my view. My view is: a hypothetical smart consequentialist human with a ruthless drive to have lots of grandkids will have more grandkids than a human with the normal suite of innate drives, like falling in love and so on. Proof: strategy-stealing. Whatever the latter human does, if it’s an objectively good way to have lots of grandkids, then the former human can notice that it’s a good strategy and do the same thing.
And then the challenge to that view is: …But we did actually evolve all these innate drives that make us intrinsically desire love and curiosity etc. Doesn’t that prove my strategy-stealing argument wrong?
I think there’s a good answer to that challenge, and it’s some combination of: (1) evolution has no way to build “a ruthless drive to have lots of grandkids” into our brains (details), and (2) even if it did, humans are not sufficiently smart and strategic in regards to long-term planning to be very effective ruthless consequentialists. (“We are the least intelligent species capable of building an industrial civilization.”)
Anyway, I claim that neither (1) nor (2) would be applicable in your AI training scenario (as I understand it), so the strategy-stealing argument would go through, and thus the RL selection pressure should at best be neutral between ruthless scheming strategies vs intrinsically honest ones, and much more likely favor the former.
(I am however assuming no interpretability / mind-reading.)
Thanks!
Re other ways to do the feat:
Bootstrapping from very smart human intelligence by doing loads of serial thinking, writing great non-sociopathic outputs, and find sft on them
Sophisticated scaffolds—basically gofai but you can insert LLMs in all over the place
Searching over different the above and RL and imitation to things for methods that don’t produce sociopaths… Evolution and within lifetime learning don’t produce sociopaths despite this not happening!!
Re RL: thanks for those examples. I’d have defined it via smg like# data points or amount of training signal -agree compute is the wrong measure.
Hmm but the cognition “try to break the rules without getting caught” might in practice lead to worse reward than just trying to follow the rules, if the chances to secretly hack are sufficiently rare and the punishment for getting caught is high enough.
Eg I think that if I tried to “pretend I cared about my friends/partner but exploit them when I can get away with it” would MASSIVELY fail me in the long run. Ppl are good at reading ppl. We’re transparent to each other. There is big cognitive overhead to maintaining two narratives. I’d end up with fewer committed/deep long-term friendships.
Relatedly, if you’re in a cult, the best way to signal deep commitment is to become a true believer.
And it will, by comparison, be way easier for the overseers in the case of AI. Interp. Seeing all behaviour. Running counterfactual experiments.
Ok interesting, we have pretty different intuitions here!
On 1, the equ that evolution chose, is way less sociopathic than it could have chosen. Some ppl are sociopaths! So it can be done. (Maybe there’s a lot of work done by the “aim for grandkids” part, sorry I didn’t read the link. But ppl do have some desires for healthy kids and grandkids)
On 2, I agree that if one agent’s cognition rises while the “are you a sociopath” checks stay constant, we’re in trouble. But we should imagine both rising with AI capabilities. (Note that humans evolved in a similar situation with both sides rising and the equ was non sociopathy.) Also, the world itself becomes more complex, making scheming plans harder to pull off.
I’m flat out sceptical the sociopath strategy would dominate in equilibrium. Eg: Men sociopaths would trick many women into partnering with them, leaving each. Women evolve to check hard for true commitment. Men evolve credible signals of non-sociopathy. I just don’t believe such signals would be impossible to find.
Zooming out, I recall you thinking that humans aren’t sociopaths bc they have some special specific reward thing that we can’t replicate, related to wanting others to approve. Whereas I see no reason to think it’s some specific thing. We just had some selection pressure to seem like good non-sociopathic allies. That selection pressure worked. If we apply similar selection to ai, it will probably also work—the implementation details won’t need to match some specific human learning circuit
Thanks for engaging!
(partly copying from another comment:) If you compare a human in 30000 BC to a human today, our brains are full of new information that wasn’t in the training data of 30000 BC. I want to talk about: what would it look to be in a world where you can put millions of LLMs in a sealed box containing a VR environment, for (the equivalent of) thousands of years, and then we open up the box and find that the LLMs have made an analogous kind of scientific and technological progress?
I don’t think any of those three options can get there (and nor can imitative learning). (I’m disputing the capabilities not alignment here.)
The first one (bootstrapping) has the issue that if the serial thinking is not 100% perfect, then it will sometimes get mistakes, and then you’re SFT’ing on the mistakes, making the model more confident in those mistakes, and then the next round of serial thinking will incorporate and build on those mistakes. Repeat a billion times in a sealed box, and I think it would spiral into nonsense—it would get dumber not smarter.
…I assume that people are already trying to do this, so I guess we’ll find out one way or the other how far it gets. ¯\_(ツ)_/¯ If I’m wrong and it does get to ASI (e.g. the “sealed box” standard above), perhaps that would be good news compared to what I’m expecting … although I suppose it might spiral into misalignment too, not sure.
The second one (scaffolds) has an issue that (IIUC) you’re piling up entire new fields of knowledge into the context window, without that knowledge being present in the weights. LLMs would be very bad at that. We can see them struggle with novel complexity in the context window, even in everyday situations. And this would be much worse. For example, imagine training an LLM before linear algebra existed, and then trying to have it understand linear algebra (matrices, bases, rank, nullity, spans, determinants, trace, eigenvectors, dual space, unitarity, etc.) purely by putting all that stuff in the context window. And then ask the LLM tricky questions that rely on those concepts. I really think it wouldn’t work, and that it will keep not working into the future.
The third one (“searching over…”) I don’t understand. Is there a typo? It sounds like “try to solve the technical alignment problem”, which of course I endorse. I don’t think the problem is fundamentally unsolvable; almost no one thinks that.
I claim that “maintaining two narratives” is super easy. We do it all the time when we talk about the fictional world of a TV show, and then in the next breath we talk about the actors and script. I think “maintaining two narratives” is hard in social settings because most of us are not sociopaths! I.e., yes, lying can be draining, but I claim it’s emotionally draining rather than cognitively draining.
Anyway, we’re ultimately talking about ASI here, which can develop whole new fields of knowledge etc. Surely it will be able to ask itself “what would the humans be looking for in this scenario?”, and then do that, whenever humans might be watching. I hear that even today’s LLMs do that (“eval awareness”).
“I’d end up with fewer committed/deep long-term friendships” is kinda circular. Relationships are not really “deep” if e.g. you’re indifferent to the other person and just sucking up to them. But that’s only a problem if you wanted a “deep” relationship in the first place.
Yes, interpretability is potentially an important caveat, but I don’t think it adds up to much reason for optimism. According to my worldview: if we found a way to use interpretability to test for scheming, then we could use it, and we would definitely find scheming, because duh, that’s the natural consequence of how we will train ASI. And now what? If we delete the model and re-run the training from scratch, we’ll just get the same result. Or, if we use this interpretability signal for fine-tuning, we have the usual problem that we’re training the AI to hide its thoughts.
I’m skeptical of “counterfactual experiments” because a smart AI will be able to tell what’s the real world, see Distinguishing test from training (Nate Soares 2022), and (again) “eval awareness” in LLMs.
This section (or at least this excerpt) seems to be analogizing evolution to LLM training. Whereas I think a better framework is to say that evolution designed a within-lifetime learning algorithm in the brain, and here we’re having a conversation about how that learning algorithm works. I claim that this learning algorithm is a yet-to-be-invented variant of model-based actor-critic RL, and that it has a weird reward function that (in a certain environment) leads to caring about our friends, and to pride, and to trying to fit in, etc., among many other things.
There was obviously selection pressure for that reward function in the case of humans, and we can keep arguing about why. But would there be one for AIs? I claim that this question is moot, because normal practice in RL does not involve choosing a reward function via an outer-loop blind search analogous to evolution. The reward function is almost always part of the learning algorithm, not a thing that is itself selected by learning. (More discussion here.)
Separately, I can talk about why I think kindness, norm-following, etc. are human innate social drives, as opposed to strategies developed by a more generic within-lifetime learning algorithm, if you’re skeptical of that claim. (Are you?)