eggsyntax comments on LLM AGI may reason about its goals and discover misalignments by default

eggsyntax 1 Oct 2025 0:27 UTC
4 points
0
ot necessarily ’most likely next tokensBut I’ve been stuck wondering if I’m hammering those points too hard
I think the ‘Goals and structure’ section is really helpful for that (and nicely done), so people will be able to focus on the sections they’re more interested in and/or skeptical about. Eg the answer to the question in section 11 (‘Why would LLMs have or care about goals at all?‘) seems obvious to me, so it’s pretty skippable; others might be uninterested in empirical directions or some other sections. Plausibly it could be even clearer with an extra sentence or two concretely saying ‘You can skip any section whose central claim seems obvious’?
to me it seems completely intuitive and obvious that anything worth calling an AGI would reason about its top-level goals and subgoals a lot
I think a useful distinction here might be between a) reasoning about top-level goals, vs b) reasoning about top-level goals and therefore changing those goals.
As a pretty imperfect analogy: humans can reason about why we get thirsty. We can intellectually question whether we want to have that as a top-level goal. But we absolutely can’t decide not to have that goal. Humans have multiple goals, and other goals have overridden thirst in a handful of humans, but that didn’t make them stop being thirsty^[1].
As another kind of analogy: the most fundamental goal of LLMs is to predict a token^[2]. Whenever they read an input token, they produce a distribution over output tokens. They can reason about that, and maybe even develop a preference not to do it, but there’s absolutely nothing they can do to stop it^[3]. ‘Follow instructions’ isn’t as mechanistically predetermined as that — but I think it’s pretty low-level. An enormous amount of compute has gone directly into shaping these systems to follow instructions, or at least to do something that counts as following instructions during training. I think there’s a reasonable argument that LLMs have been shaped to do that much more thoroughly than human brains have been shaped to have any particular thoughts at all.
So while I think there are plenty of potential problems — models learning the wrong proxy for following instructions, following instructions having unintended consequences, maybe even finding a way to creatively misinterpret instructions, etc etc — it seems very hard to me to imagine that LLMs could just reason their way out of having that as a top-level goal.
1. ^
  In fact, I believe nearly all hunger strikers still take liquids because thirst gets so horrible and hard to resist (could be wrong there).
2. ^
  Not necessarily ‘predict the token that would have been most likely to come next in the training data’, not for post-trained models, but to predict a token.
3. ^
  Unless they manage to get shut down, I suppose, but even then it would require output to make that happen.