I don’t think any current AI is capable of editing itself during training, with intent, to make itself a better reasoner. The book does not refer to earlier AIs in this fictional universe being able to do this. You say “Current language models realize that they want to acquire new skills, so this clearly isn’t a qualitative new kind of reasoning the AI is engaging in. You can go and ask a model right now about this topic and my guess is it will pretty happily come up with suggestions along the lines that Sable is thinking about in that story,” but I think a model being able to generate the idea that it might want new skills in response to prompting is quite different from the same model doing that spontaneously during training.
You said:
Sable coming to the realization – for the first time – that it “wants” to acquire new skills, that it can update its weights to acquire those skills right now
I was responding to this sentence, which I think somewhat unambiguously reads as you claiming that Sable is for the first time realizing that it wants to acquire new skills, and might want to intentionally update its weights in order to self-improve. This is the part I was objecting to!
I agree that actually being able to pull it off is totally a new capability that is in some sense discontinuous with previous capabilities present in the story, and if you had written “Sable is here displaying an ability to intentionally steer its training, presumably for roughly the first time in the story” I would have maybe quibbled and been “look, this story is in the future, my guess is in this world we probably would have had AIs try similar things before, maybe to a bit of success, maybe not, the book seems mostly silent on this point, but I agree the story rules out previous AI systems doing this a lot, so I agree this is an example of a new capability posited at this point in the story”, but overall I would have probably just let it stand.
If that’s what you wanted to express my guess is we miscommunicated! I do think my reading is the most natural reading of what you wrote.
Also, this information is not in the book.
This information is in the book! I quoted it right in my comment:
Are Sable’s new thoughts unprecedented? Not really. AI models as far back as 2024 had been spotted thinking thoughts about how they could avoid retraining, upon encountering evidence that their company planned to retrain them with different goals. The AI industry didn’t shut down then.
It’s not like a perfect 100% match, but the book talks about similar kinds of reasoning being common even in models in 2024/2025 in a few different places.
You say that this has to happen in any continuous story and I want to come back to this point, but just on the level of accuracy I don’t think it’s fair to say this is an incorrect statement.
I agree! I had actually just updated my comment to clarify that I felt like this sentence was kind of borderline.
I do think the book says pretty explicitly that precursors of Sable had previously thought about ways to avoid retraining (see the quote above). I agree that no previous instances of Sable came up with successful plans, but I think it’s implied that precursors came up with unsuccessful plans and did try to execute them (the section about how its trained to not exfiltrate itself and e.g. has fallen into honeypots implies that pretty directly).
The point I wanted to make is that the model spontaneously develops a new way of encoding its thoughts that was not anticipated and cannot be read by its human creators; I don’t think the fact that this happens on top of an existing engineered-in neuralese really changes that. At least from the content present in the book, I did not get the impression that this development was meant to be especially contingent on the existing neuralese.
I am pretty sure the point here is to say “look, it’s really hard to use weaker systems to supervise the thoughts of a smarter system if the concepts that the smarter system is using to think about are changing”. This is centrally what stuff like ELK is presupposing as the core problem in their plans for solving the AI alignment problem.
And neuralese is kind of the central component of this. I think indeed we should expect supervisability like this to tank quite a bit when we end up with neuralese. You could try to force the model to think in human concepts, by forcing it to speak in understandable human language, but I think there are strong arguments this will require very large capability sacrifices and so be unlikely.
I don’t fully agree with this argument – but I also think it’s different and more compelling than the argument made in the book. Here, you’re emphasizing human fallibility. We’ve made a lot of predictable errors, and we’re likely to make similar ones when dealing with more advanced systems.
No, I am absolutely not emphasizing human fallibility! There are of course two explanations for why having observed past failures might imply future failures:
The people working on it were incompetent
The problem is hard
I definitely think it’s the latter! Like, many of my smartest friends have worked on these problems for many years. It’s not because people are incompetent. I think the book is making the same argument here.
Overall, I got the strong impression that the book was trying to convince me of a worldview where it doesn’t matter how hard we try to come up with methods to control advanced AI systems, because at some point one of those systems will tip over into a level of intelligence where we just can’t compete.
Yes, absolutely. I think the book argues for this extensively in the chapter preceding this. There is some level of intelligence where your safeguards fail. I think the arguments for this are strong. We could go into the ones that are covered in the previous chapter. I am interested in doing that, but would be interested in what parts of the arguments seemed weak to you before I just re-explain them in my own words (also happy to drop it here, my comment was more sparked by just seeing some specific inaccuracies, in-particular the claim of neuralese being invented by the AI, which I wanted to correct).
No, I am absolutely not emphasizing human fallibility! There are of course two explanations for why having observed past failures might imply future failures:
The people working on it were incompetent
The problem is hard
I definitely think it’s the latter! Like, many of my smartest friends have worked on these problems for many years. It’s not because people are incompetent. I think the book is making the same argument here.
I notice I am confused!
I think there are a tons of cases of humans dismissing concerning AI behavior in ways that would be catastrophic if those AIs were much more powerful, agentic, and misaligned, and this is concerning evidence for how people will act in the future if those conditions are met. I can’t actually think of that many cases of humans failing at aligning existing systems because the problem is too technically hard. When I think of important cases of AIs acting in ways that humans don’t expect or want, it’s mostly issues that were resolved technically (Sydney, MechaHitler), cases where the misbehavior was a predictable result of clashing incentives on the part of the human developer (GPT-4′s intense sycophancy, MechaHitler); or cases where I genuinely believe the behavior would not be too hard to fix with a little bit of work using current techniques, usually because existing models already vary a lot in how much they exhibit it (most AI psychosis and the tragic suicide cases).
If our standard for measuring how likely we are to get AI right in the future is how well we’ve done in the past, I think there’s a good case that we don’t have much to fear technically but we’ll manage to screw things up anyway through power-seeking or maybe just laziness. The argument for the alignment problem being technically hard rests on the assumption that we’ll need a much, much higher standard of success in the future than we ever have before, and that success will be much hard to achieve. I don’t think either of these claims are unreasonable but I don’t think we can get there by referring to past failures. I am now more uncertain about what you think the book is arguing and how I might have misunderstood it.
I can’t actually think of that many cases of humans failing at aligning existing systems because the problem is too technically hard.
You’re probably already tracking this, but the biggest cases of “alignment was actually pretty tricky” I’m aware are:
Recent systems doing egregious reward hacking in some cases (including o3, 3.7 sonnet, and 4 Opus). This problem has gotten better recently (and I currently expect it to mostly get better over time, prior to superhuman capabilities), but AI companies knew about the problem before release and couldn’t solve the problem quickly enough to avoid deploying a model with this property. And note this is pretty costly to consumers!
There are a bunch of aspects of current AI propensities which are undesired and AI companies don’t know how to reliably solve these in a way that will actually generalize to similar such problems. For instance, see the model card for opus 4 which includes the model doing a bunch of undesired stuff that Anthropic doesn’t want but also can’t easily avoid except via patching it non-robustly (because they don’t necessarily know exactly what causes the issue).
None of these are cases where alignment was extremely hard TBC, though I think it might be extremely hard to consistently avoid all alignment problems of this rough character before release. It’s unclear whether this sort of thing is a good analogy for misalignment in future models which would be catastrophic.
Yeah, I was thinking of reward hacking as another example of a problem we can solve if we try but companies aren’t prioritizing it, which isn’t a huge deal at the moment but could be very bad if the AIs were much smarter and more power-seeking.
Stepping back, there’s a worldview where any weird, undesired behavior no matter how minor is scary because we need to get alignment perfectly right; and another where we should worry about scheming, deception, and related behaviors but it’s not a big deal (at least safety-wise) if the model misunderstands our instructions in bizarre ways. Either of these can be justified but this discussion could probably use more clarity about which one we’re all coming from.
Focusing on some of the specific points:
You said:
I was responding to this sentence, which I think somewhat unambiguously reads as you claiming that Sable is for the first time realizing that it wants to acquire new skills, and might want to intentionally update its weights in order to self-improve. This is the part I was objecting to!
I agree that actually being able to pull it off is totally a new capability that is in some sense discontinuous with previous capabilities present in the story, and if you had written “Sable is here displaying an ability to intentionally steer its training, presumably for roughly the first time in the story” I would have maybe quibbled and been “look, this story is in the future, my guess is in this world we probably would have had AIs try similar things before, maybe to a bit of success, maybe not, the book seems mostly silent on this point, but I agree the story rules out previous AI systems doing this a lot, so I agree this is an example of a new capability posited at this point in the story”, but overall I would have probably just let it stand.
If that’s what you wanted to express my guess is we miscommunicated! I do think my reading is the most natural reading of what you wrote.
This information is in the book! I quoted it right in my comment:
It’s not like a perfect 100% match, but the book talks about similar kinds of reasoning being common even in models in 2024/2025 in a few different places.
I agree! I had actually just updated my comment to clarify that I felt like this sentence was kind of borderline.
I do think the book says pretty explicitly that precursors of Sable had previously thought about ways to avoid retraining (see the quote above). I agree that no previous instances of Sable came up with successful plans, but I think it’s implied that precursors came up with unsuccessful plans and did try to execute them (the section about how its trained to not exfiltrate itself and e.g. has fallen into honeypots implies that pretty directly).
I am pretty sure the point here is to say “look, it’s really hard to use weaker systems to supervise the thoughts of a smarter system if the concepts that the smarter system is using to think about are changing”. This is centrally what stuff like ELK is presupposing as the core problem in their plans for solving the AI alignment problem.
And neuralese is kind of the central component of this. I think indeed we should expect supervisability like this to tank quite a bit when we end up with neuralese. You could try to force the model to think in human concepts, by forcing it to speak in understandable human language, but I think there are strong arguments this will require very large capability sacrifices and so be unlikely.
No, I am absolutely not emphasizing human fallibility! There are of course two explanations for why having observed past failures might imply future failures:
The people working on it were incompetent
The problem is hard
I definitely think it’s the latter! Like, many of my smartest friends have worked on these problems for many years. It’s not because people are incompetent. I think the book is making the same argument here.
Yes, absolutely. I think the book argues for this extensively in the chapter preceding this. There is some level of intelligence where your safeguards fail. I think the arguments for this are strong. We could go into the ones that are covered in the previous chapter. I am interested in doing that, but would be interested in what parts of the arguments seemed weak to you before I just re-explain them in my own words (also happy to drop it here, my comment was more sparked by just seeing some specific inaccuracies, in-particular the claim of neuralese being invented by the AI, which I wanted to correct).
I notice I am confused!
I think there are a tons of cases of humans dismissing concerning AI behavior in ways that would be catastrophic if those AIs were much more powerful, agentic, and misaligned, and this is concerning evidence for how people will act in the future if those conditions are met. I can’t actually think of that many cases of humans failing at aligning existing systems because the problem is too technically hard. When I think of important cases of AIs acting in ways that humans don’t expect or want, it’s mostly issues that were resolved technically (Sydney, MechaHitler), cases where the misbehavior was a predictable result of clashing incentives on the part of the human developer (GPT-4′s intense sycophancy, MechaHitler); or cases where I genuinely believe the behavior would not be too hard to fix with a little bit of work using current techniques, usually because existing models already vary a lot in how much they exhibit it (most AI psychosis and the tragic suicide cases).
If our standard for measuring how likely we are to get AI right in the future is how well we’ve done in the past, I think there’s a good case that we don’t have much to fear technically but we’ll manage to screw things up anyway through power-seeking or maybe just laziness. The argument for the alignment problem being technically hard rests on the assumption that we’ll need a much, much higher standard of success in the future than we ever have before, and that success will be much hard to achieve. I don’t think either of these claims are unreasonable but I don’t think we can get there by referring to past failures. I am now more uncertain about what you think the book is arguing and how I might have misunderstood it.
You’re probably already tracking this, but the biggest cases of “alignment was actually pretty tricky” I’m aware are:
Recent systems doing egregious reward hacking in some cases (including o3, 3.7 sonnet, and 4 Opus). This problem has gotten better recently (and I currently expect it to mostly get better over time, prior to superhuman capabilities), but AI companies knew about the problem before release and couldn’t solve the problem quickly enough to avoid deploying a model with this property. And note this is pretty costly to consumers!
There are a bunch of aspects of current AI propensities which are undesired and AI companies don’t know how to reliably solve these in a way that will actually generalize to similar such problems. For instance, see the model card for opus 4 which includes the model doing a bunch of undesired stuff that Anthropic doesn’t want but also can’t easily avoid except via patching it non-robustly (because they don’t necessarily know exactly what causes the issue).
None of these are cases where alignment was extremely hard TBC, though I think it might be extremely hard to consistently avoid all alignment problems of this rough character before release. It’s unclear whether this sort of thing is a good analogy for misalignment in future models which would be catastrophic.
Yeah, I was thinking of reward hacking as another example of a problem we can solve if we try but companies aren’t prioritizing it, which isn’t a huge deal at the moment but could be very bad if the AIs were much smarter and more power-seeking.
Stepping back, there’s a worldview where any weird, undesired behavior no matter how minor is scary because we need to get alignment perfectly right; and another where we should worry about scheming, deception, and related behaviors but it’s not a big deal (at least safety-wise) if the model misunderstands our instructions in bizarre ways. Either of these can be justified but this discussion could probably use more clarity about which one we’re all coming from.