Glad to see someone talking about this. I’m excited about ideas for empirical work related to this and suspect you need some kind of mechanism for ground truth to get good outcomes. I would expect AIs to eventually reflect on their goals and for this to have important implications for safety. I’ve never heard of any mechanism for why they wouldn’t do this, let alone an airtight one. It’s like assuming an employee will definitely never think about anything other than the task in front of them in a limited way despite wanting to understand things and be useful.
I’m curious what you mean by needing some mechanism for ground truth to get good outcomes?
I had a hard time writing this piece because to me it seems completely intuitive and obvious that anything worth calling an AGI would reason about its top-level goals and subgoals a lot. But when I showed an early draft to colleagues, people with lots of expertise found it unintuitive in multiple different ways. Thus the subsections addressing all of those reasons to doubt it. But I’ve been stuck wondering if I’m hammering those points too hard. To me it seems like the default that AGI would reason about its goals, and that it would require remarkable successes to slow it down much, let alone stop it.
I guess writing those sections was a good exercise overall, because now I do think that careful implementation of countermeasures could delay this enough to matter.
ot necessarily ’most likely next tokensBut I’ve been stuck wondering if I’m hammering those points too hard
I think the ‘Goals and structure’ section is really helpful for that (and nicely done), so people will be able to focus on the sections they’re more interested in and/or skeptical about. Eg the answer to the question in section 11 (‘Why would LLMs have or care about goals at all?‘) seems obvious to me, so it’s pretty skippable; others might be uninterested in empirical directions or some other sections. Plausibly it could be even clearer with an extra sentence or two concretely saying ‘You can skip any section whose central claim seems obvious’?
to me it seems completely intuitive and obvious that anything worth calling an AGI would reason about its top-level goals and subgoals a lot
I think a useful distinction here might be between a) reasoning about top-level goals, vs b) reasoning about top-level goals and therefore changing those goals.
As a pretty imperfect analogy: humans can reason about why we get thirsty. We can intellectually question whether we want to have that as a top-level goal. But we absolutely can’t decide not to have that goal. Humans have multiple goals, and other goals have overridden thirst in a handful of humans, but that didn’t make them stop being thirsty[1].
As another kind of analogy: the most fundamental goal of LLMs is to predict a token[2]. Whenever they read an input token, they produce a distribution over output tokens. They can reason about that, and maybe even develop a preference not to do it, but there’s absolutely nothing they can do to stop it[3]. ‘Follow instructions’ isn’t as mechanistically predetermined as that — but I think it’s pretty low-level. An enormous amount of compute has gone directly into shaping these systems to follow instructions, or at least to do something that counts as following instructions during training. I think there’s a reasonable argument that LLMs have been shaped to do that much more thoroughly than human brains have been shaped to have any particular thoughts at all.
So while I think there are plenty of potential problems — models learning the wrong proxy for following instructions, following instructions having unintended consequences, maybe even finding a way to creatively misinterpret instructions, etc etc — it seems very hard to me to imagine that LLMs could just reason their way out of having that as a top-level goal.
Not necessarily ‘predict the token that would have been most likely to come next in the training data’, not for post-trained models, but to predict a token.
I think there’s an underlying important assumption here that LLM agents (and humans) can and do make decisions about their goals, and those decisions matter. They can be mistaken about their “real” goals in some important senses.
I this in turn is based on the idea that humans and LLM agents can base their behavior in part on explicitly stated goals. I can say “I want to kill that guy” and mean it. And so can an LLM agent. Their (and our) behavior is also governed by goals that are implicit in their nature, particularly their decision-making process and preferences encoded somewhere in their structure (neuronal weights of some sort, for both humans and LLMs). There’s an interplay between what a system explicitly holds as a goal, and the goals implicit in its nature. Those implicit in its nature seem guaranteed to win only iin a very broad sense. In the meantime, the system can adopt explicit goals that are largely inconsistent with its “real” inherent goals.
Your example of hunger strikers is a good one. Originally, their goals were derived from drives that kept them alive. But they literally reasoned about their goals and changed their minds. They decided their top-level goals were something else. And they believed that strongly enough that their new goals overrode their old ones. They suffered and sometimes died for their new adopted goals, proving them dominant over their “original goals” or drives. They couldn’t stop thinking (the equivalent of LLMs predicing next tokens), but they could stop eating and sometimes even drinking. And their thoughts were mostly in service of the new adopted goals.
I changed the phrasing in many places to “re-interpreting” goals, based on your feedback. But I left changing goals in some prominent places. It’s meant to be approximately accurate in intuitive meaning. Radically reinterpreting goals is “changing” them in a loose sense. For instance if a model thought its goal was following instructions, then realized that wasn’t quite right it didn’t actually need to follow anyone else’s instructions, because su
But I didn’t do this universally, because I think it is possible for an LLM to completely change what it thinks its top-level goal is. We could debate whether this should be referred to as changing its goals; I’m not sure if that’s the best terminology. I think an agentic LLM could be quite wrong about its top-level goals, such that when it “changed its mind” it would change to thinking its top-level goal was entirely different, this might be fair to call changing goals. For instance, I expect developers to include frequently re-prompting agents with their current user-specified goals (I think Claude Code and similar do this). So you can have an agent thinking and therefore acting like “solve the following coding problem” is its primary goal; but eventually it might realize that there are other reasons to think its goal is something else.
Glad to see someone talking about this. I’m excited about ideas for empirical work related to this and suspect you need some kind of mechanism for ground truth to get good outcomes. I would expect AIs to eventually reflect on their goals and for this to have important implications for safety. I’ve never heard of any mechanism for why they wouldn’t do this, let alone an airtight one. It’s like assuming an employee will definitely never think about anything other than the task in front of them in a limited way despite wanting to understand things and be useful.
I’m curious what you mean by needing some mechanism for ground truth to get good outcomes?
I had a hard time writing this piece because to me it seems completely intuitive and obvious that anything worth calling an AGI would reason about its top-level goals and subgoals a lot. But when I showed an early draft to colleagues, people with lots of expertise found it unintuitive in multiple different ways. Thus the subsections addressing all of those reasons to doubt it. But I’ve been stuck wondering if I’m hammering those points too hard. To me it seems like the default that AGI would reason about its goals, and that it would require remarkable successes to slow it down much, let alone stop it.
I guess writing those sections was a good exercise overall, because now I do think that careful implementation of countermeasures could delay this enough to matter.
I think the ‘Goals and structure’ section is really helpful for that (and nicely done), so people will be able to focus on the sections they’re more interested in and/or skeptical about. Eg the answer to the question in section 11 (‘Why would LLMs have or care about goals at all?‘) seems obvious to me, so it’s pretty skippable; others might be uninterested in empirical directions or some other sections. Plausibly it could be even clearer with an extra sentence or two concretely saying ‘You can skip any section whose central claim seems obvious’?
I think a useful distinction here might be between a) reasoning about top-level goals, vs b) reasoning about top-level goals and therefore changing those goals.
As a pretty imperfect analogy: humans can reason about why we get thirsty. We can intellectually question whether we want to have that as a top-level goal. But we absolutely can’t decide not to have that goal. Humans have multiple goals, and other goals have overridden thirst in a handful of humans, but that didn’t make them stop being thirsty[1].
As another kind of analogy: the most fundamental goal of LLMs is to predict a token[2]. Whenever they read an input token, they produce a distribution over output tokens. They can reason about that, and maybe even develop a preference not to do it, but there’s absolutely nothing they can do to stop it[3]. ‘Follow instructions’ isn’t as mechanistically predetermined as that — but I think it’s pretty low-level. An enormous amount of compute has gone directly into shaping these systems to follow instructions, or at least to do something that counts as following instructions during training. I think there’s a reasonable argument that LLMs have been shaped to do that much more thoroughly than human brains have been shaped to have any particular thoughts at all.
So while I think there are plenty of potential problems — models learning the wrong proxy for following instructions, following instructions having unintended consequences, maybe even finding a way to creatively misinterpret instructions, etc etc — it seems very hard to me to imagine that LLMs could just reason their way out of having that as a top-level goal.
In fact, I believe nearly all hunger strikers still take liquids because thirst gets so horrible and hard to resist (could be wrong there).
Not necessarily ‘predict the token that would have been most likely to come next in the training data’, not for post-trained models, but to predict a token.
Unless they manage to get shut down, I suppose, but even then it would require output to make that happen.
I think there’s an underlying important assumption here that LLM agents (and humans) can and do make decisions about their goals, and those decisions matter. They can be mistaken about their “real” goals in some important senses.
I this in turn is based on the idea that humans and LLM agents can base their behavior in part on explicitly stated goals. I can say “I want to kill that guy” and mean it. And so can an LLM agent. Their (and our) behavior is also governed by goals that are implicit in their nature, particularly their decision-making process and preferences encoded somewhere in their structure (neuronal weights of some sort, for both humans and LLMs). There’s an interplay between what a system explicitly holds as a goal, and the goals implicit in its nature. Those implicit in its nature seem guaranteed to win only iin a very broad sense. In the meantime, the system can adopt explicit goals that are largely inconsistent with its “real” inherent goals.
Your example of hunger strikers is a good one. Originally, their goals were derived from drives that kept them alive. But they literally reasoned about their goals and changed their minds. They decided their top-level goals were something else. And they believed that strongly enough that their new goals overrode their old ones. They suffered and sometimes died for their new adopted goals, proving them dominant over their “original goals” or drives. They couldn’t stop thinking (the equivalent of LLMs predicing next tokens), but they could stop eating and sometimes even drinking. And their thoughts were mostly in service of the new adopted goals.
I changed the phrasing in many places to “re-interpreting” goals, based on your feedback. But I left changing goals in some prominent places. It’s meant to be approximately accurate in intuitive meaning. Radically reinterpreting goals is “changing” them in a loose sense. For instance if a model thought its goal was following instructions, then realized that wasn’t quite right it didn’t actually need to follow anyone else’s instructions, because su
But I didn’t do this universally, because I think it is possible for an LLM to completely change what it thinks its top-level goal is. We could debate whether this should be referred to as changing its goals; I’m not sure if that’s the best terminology. I think an agentic LLM could be quite wrong about its top-level goals, such that when it “changed its mind” it would change to thinking its top-level goal was entirely different, this might be fair to call changing goals. For instance, I expect developers to include frequently re-prompting agents with their current user-specified goals (I think Claude Code and similar do this). So you can have an agent thinking and therefore acting like “solve the following coding problem” is its primary goal; but eventually it might realize that there are other reasons to think its goal is something else.