Thanks for your thoughts and concerns! We’ve updated the blog post to address some of them. The most important changes involve adding nuance to language and being explicit everywhere that our measure of incoherence refers to the relative contribution of variance to error, and is separate from how overall error changes with model capability, or how self-consistent successful trajectories are.
The blog post was put together much more hastily than the paper. Probably a lot more cumulative effort will go into reading the blog than the paper, so this is the opposite of how we should have distributed our effort. Reading and editing our post again, I agree parts of it were sloppy. Hopefully it is now at least improved (though certainly still not perfect!). This was my fault – I’m sorry. And I regret that we underutilized an opportunity to communicate these ideas to a larger audience.
That being said … I think your take on the paper is incorrect.
in most experiments, model coherence remained unchanged or increased with size.
This is actually quite difficult to judge, and depends on how you weigh different experiments. e.g. if you look at Figure 4, you will see that in two out of three experimental settings we see a robust positive correlation between measures of intelligence and error incoherence (panes b and c), and in only one configuration (pane a) do we see a mixed relationship.
My takeaway from the experiments across model scale is that the observed relationship between intelligence and incoherence in errors is complex, but that increasing model intelligence does not consistently reduce the incoherence of errors on fixed tasks. In contrast, the positive relationship between task length and error incoherence is extremely robust. So in net this suggests that as smarter models perform more complex tasks, their errors will be more incoherent.
But bias stemming from a lack of ability is not the same as bias stemming from a lack of propensity.
I’m confused by the intended meaning here. Propensity is a tendency to behave in a particular way. Lack of ability could be a cause of an observed propensity, but I don’t think that’s what you intend to say? (Do you mean to use a word like “intention” rather than “propensity”?)
I think this result provides approximately no evidence that can be used to extrapolate to superintelligent AIs where misalignment might pose actual risks.
The current AI revolution is driven by the observation that if you find scaling laws for losses, those scaling laws often extrapolate reliably from toy systems all the way up to intelligent models.
I expect empirical relationships between different error components to predict the types of error we see in superintelligent systems better than I expect thought experiments to generalize to the behavior of superintelligent systems.
I don’t expect to convince you of this stance. You should understand though that from my perspective, simple operationalizable definitions, and empirical scaling law-like measurements, are some of the strongest predictors we can build for how superintelligent systems will behave.
Ok, now let’s consider a model with variance of 1e-3 and bias of 1e-6. Huge “incoherence”! Am I supposed to be reassured that this model will therefore not coherently pursue goals contrary to my interests?
This does seem like a much better position to be in. It would suggest that when the model pursues a goal it is not trained to pursue, that goal will not be self consistent across instances, or across time (see incoherence vs. sequence length). There are still ways this can go wrong … but at least the bad behavior is less likely to be coherent across time and space.
(unless of course the model was directly trained to pursue a goal contrary to your interests. See discussion of goal misspecification and reward hacking.)
an extremely dumb, broken model which always outputs the same answer regardless of input is extremely “coherent”. A rock is also extremely “coherent”, by this definition.
That is correct, about the dumb broken model. For the rock it’s undefined. You will need to define a training objective, and a way to compare outcomes produced by the rock to oracular outcomes under the training objective.
The paper basically assumes away the possibility of deceptive schemers
I don’t think this is correct, unless you mean something different than I understand by deceptive schemer. A deceptive schemer maps cleanly onto bias-dominated failures. It will consistently achieve a specific outcome, which is not the outcome we want it to achieve.
> The set of dynamical systems that act as optimizers of a fixed loss is measure zero in the space of all dynamical systems.
This seems to me like a vacuous attempt at defining away the possibility of building superintelligence (or perhaps “coherent optimizers”). I will spend no effort on its refutation, Claude 4.5 Opus being capable of doing a credible job:
Claude’s response here is correct, and is the point. Essentially every interesting property that you would like to imbue into a dynamical system is measure zero in the space of dynamical systems. You almost always have to work hard for all of them.
The broader (and weaker) argument—that we “shouldn’t expect AI to act as coherent optimizers without considerable effort”—might be trivially true. Unfortunately Anthropic (and OpenAI, and Google Deepmind, etc) are putting forth considerable effort to build systems that can reliably solve extremely difficult problems over long time horizons (“coherent optimizers”).
I think your argument is confusing overall error rate, and the contribution of variance to the error. The frontier labs are working very hard to reduce the overall error rate of their systems. They are not applying meaningful optimization pressure on the decomposition of those errors into bias or variance.
As models get smarter, we expect them to act more like optimizers targeting the outcomes they were trained to achieve. The question here isn’t whether the models will get better at optimizing a target. The question is, when they fail to optimize the right target, what will that failure look like?
The authors also say that we shouldn’t “expect this to be easier than training other properties into their dynamics”, but there are reasons to think this is false, which renders the bare assertion to the contrary kind of strange.
The article you link to doesn’t imply that models will become optimizers faster than they will settle on the right goal. It doesn’t speak to that question. It implies that as the model is trained to a loss of 0 (e.g. on some RL task), it will become a perfect optimizer. As the model is trained to a loss of 0 though, it will also come to perfectly align with the training goal.
The question we are probing is what the deviations from perfection look like. If we have an extremely powerful but still imperfect model, will its deviations from perfection tend to be dominated by it pursuing the wrong outcome or by it not acting in pursuit of a consistent outcome.
Thanks for your thoughts and concerns! We’ve updated the blog post to address some of them. The most important changes involve adding nuance to language and being explicit everywhere that our measure of incoherence refers to the relative contribution of variance to error, and is separate from how overall error changes with model capability, or how self-consistent successful trajectories are.
The blog post was put together much more hastily than the paper. Probably a lot more cumulative effort will go into reading the blog than the paper, so this is the opposite of how we should have distributed our effort. Reading and editing our post again, I agree parts of it were sloppy. Hopefully it is now at least improved (though certainly still not perfect!). This was my fault – I’m sorry. And I regret that we underutilized an opportunity to communicate these ideas to a larger audience.
That being said … I think your take on the paper is incorrect.
This is actually quite difficult to judge, and depends on how you weigh different experiments. e.g. if you look at Figure 4, you will see that in two out of three experimental settings we see a robust positive correlation between measures of intelligence and error incoherence (panes b and c), and in only one configuration (pane a) do we see a mixed relationship.
My takeaway from the experiments across model scale is that the observed relationship between intelligence and incoherence in errors is complex, but that increasing model intelligence does not consistently reduce the incoherence of errors on fixed tasks. In contrast, the positive relationship between task length and error incoherence is extremely robust. So in net this suggests that as smarter models perform more complex tasks, their errors will be more incoherent.
I’m confused by the intended meaning here. Propensity is a tendency to behave in a particular way. Lack of ability could be a cause of an observed propensity, but I don’t think that’s what you intend to say? (Do you mean to use a word like “intention” rather than “propensity”?)
The current AI revolution is driven by the observation that if you find scaling laws for losses, those scaling laws often extrapolate reliably from toy systems all the way up to intelligent models.
I expect empirical relationships between different error components to predict the types of error we see in superintelligent systems better than I expect thought experiments to generalize to the behavior of superintelligent systems.
I don’t expect to convince you of this stance. You should understand though that from my perspective, simple operationalizable definitions, and empirical scaling law-like measurements, are some of the strongest predictors we can build for how superintelligent systems will behave.
This does seem like a much better position to be in. It would suggest that when the model pursues a goal it is not trained to pursue, that goal will not be self consistent across instances, or across time (see incoherence vs. sequence length). There are still ways this can go wrong … but at least the bad behavior is less likely to be coherent across time and space.
(unless of course the model was directly trained to pursue a goal contrary to your interests. See discussion of goal misspecification and reward hacking.)
That is correct, about the dumb broken model. For the rock it’s undefined. You will need to define a training objective, and a way to compare outcomes produced by the rock to oracular outcomes under the training objective.
I don’t think this is correct, unless you mean something different than I understand by deceptive schemer. A deceptive schemer maps cleanly onto bias-dominated failures. It will consistently achieve a specific outcome, which is not the outcome we want it to achieve.
Claude’s response here is correct, and is the point. Essentially every interesting property that you would like to imbue into a dynamical system is measure zero in the space of dynamical systems. You almost always have to work hard for all of them.
I think your argument is confusing overall error rate, and the contribution of variance to the error. The frontier labs are working very hard to reduce the overall error rate of their systems. They are not applying meaningful optimization pressure on the decomposition of those errors into bias or variance.
As models get smarter, we expect them to act more like optimizers targeting the outcomes they were trained to achieve. The question here isn’t whether the models will get better at optimizing a target. The question is, when they fail to optimize the right target, what will that failure look like?
The article you link to doesn’t imply that models will become optimizers faster than they will settle on the right goal. It doesn’t speak to that question. It implies that as the model is trained to a loss of 0 (e.g. on some RL task), it will become a perfect optimizer. As the model is trained to a loss of 0 though, it will also come to perfectly align with the training goal.
The question we are probing is what the deviations from perfection look like. If we have an extremely powerful but still imperfect model, will its deviations from perfection tend to be dominated by it pursuing the wrong outcome or by it not acting in pursuit of a consistent outcome.