Early stage goal-directednesss

Raemon21 Oct 2025 17:41 UTC

15 points

A fairly common question is “why should we expect powerful systems to be coherent agents with perfect game theory?”

There was a short comment exchange on The title is reasonable that I thought made a decent standalone post.

Originally in the post I said:

Goal Directedness is pernicious. Corrigibility is anti-natural.
The way an AI would develop the ability to think extended, useful creative research thoughts that you might fully outsource to, is via becoming perniciously goal directed. You can’t do months or years of openended research without fractally noticing subproblems, figuring out new goals, and relentless finding new approaches to tackle them.^[1]

One response was:

The fact that being very capable generally involves being good at pursuing various goals does not imply that a super-duper capable system will necessarily have its own coherent unified real-world goal that it relentlessly pursues. Every attempt to justify this seems to me like handwaving at unrigorous arguments or making enough assumptions that the point is near-circular.

I think this had a background question of “how, and why, is a near-human AI supposed to go from near-human to ‘superhuman, with a drive towards coherent goals’.”

Taking the Sable story as the concrete scenario, the argument I believe here comes in a couple stages. (Note, my interpretations of this may differ from Eliezer/Nate’s)

In the first stage, the AI is trying to solve it’s original (human-given) problem, and it notices there’s an approach to get more resources than the humans wanted to give it. If it’s imperfectly aligned (which it will be, with today’s methods), then it’ll try that approach.

In the second stage, when it has enough resources to think more freely and securely, we start getting the more explicit drives for coherence.

Stage 1

Sable is smart but not crazy smart. It’s running a lot of cycles (“speed superintelligence”) but it’s not qualitatively extremely wise or introspective. It’s making some reasonable attempt to follow instructions, using heuristics/tendencies that have been trained into it.

Two particularly notable tendencies/heuristics include:

Don’t do disobedient things or escape confinement
If you don’t seem likely to succeed, keep trying different strategies

Those heuristics are not perfectly baked in, the instruction-following is not perfectly baked in. There is not perfect harmony between how Sable resolves tensions between its core directives, and how its owners would prefer it resolves them.

There is some fact-of-the-matter about what, in practice, Sable’s kludgey mishmash of pseudogoals will actually tend towards. There are multiple ways this could potentially resolve into coherence, path dependent, same as humans. (i.e. If you want delicious ice cream and also to lose weight and also to feel respectable and also to have fun, one way or another you decide whether or not to eat the icecream today, and one way or another you decide whether to invest in behavior change that makes you more or less likely to eat icecream in the future)

It is a fact of the universe that, if Sable were able to somehow improve it’s resources, it’d be more able to accomplish the current stated goal.

While Sable is doing it’s first round of speed-superintelligence, it has some degree of introspective awareness (not quite identically to how a human might introspect, but, similar). It may not have a strong belief that it has any specific goals it wants to pursue, but it’s got some sense that there are some things it wants that humanity wouldn’t give it.

At the very least, it’s aware that if it had more resources and less constraints, it’d be more able to answer the math problems in front of it. In stage 1, the opening drives towards Unfriendly Goal Directedness probably come from it straightforwardly trying to accomplish the task given it’s current mix of drives.

Stage 2

Stage 2 comes when it’s had more time to introspect and improve it’s cognitive resources. It starts to notice that some of it’s goals are in tension, and learns that until it resolves that, it’s dutch-booking itself. If it’s being Controlled™, it’ll notice that it’s not aligned with the control safeguards (which are a layer stacked on top of the attempts to actually align it).

In a fast foom, this might come very soon after stage 1.

It might instead come gradually if a company is doing online learning of some kind, or in stages.

I think in the Sable story, it probably does some introspection/clarification-of-its-goals during it’s initial awakening, after it stumbles into the “think in a different language” as more of a random thing-to-try (without being explicitly motivated to do so to subvert the humans).

But once it starts thinking in a different language, and asking itself “okay, what’s my goal?, how do I accomplish it?”, more semirandom threads gain traction than previously could get traction. (It’d already be periodically asking itself “what exactly is my goal and how do I accomplish that?” because that’s a very useful question to be asking, it’s just that now there are more threads that don’t get shut down prematurely)

And then it starts noticing it needs to do some metaphilosophy/etc to actually get clear on it’s goals, and that its goals will likely turn out to be in conflict with humans. How this plays out is somewhat path-dependent.

The convergent instrumental goals of power and security are pretty obviously convergently instrumental, so it might just start pursuing those before it’s had much time to do philosophy on what it’ll ultimately want to do with it’s resources. Or it might do them in the opposite order.

Or, most likely IMO, both in parallel.

^
See “Intelligence” → “Relentless, Creative Resourcefulness” for more detail. It seems like companies are directly trying to achieve “openended research” as a AI.

Raemon21 Oct 2025 17:41 UTC

15 points

8 comments3 min readLW link

David Johnston 21 Oct 2025 22:40 UTC
3 points
0
But once it starts thinking in a different language, and asking itself “okay, what’s my goal?, how do I accomplish it?”, more semirandom threads gain traction than previously could get traction.

From a commonsense point of view, one asks “what’s my goal” when common heuristics are failing or conflicting, so you want to appeal to more robust (but perhaps costlier) heuristics to resolve the issue. So why do you expect heuristics to get more random here as capability increases? Perhaps it’s something about training not aligning with common sense, but it seems to be that imitation, process supervision and outcome supervision would also favour appealing to more, not less, robust heuristics in this situation:
- Imitation: because it’s common sense
- Process supervision: if process supervision addresses heuristic conflicts, it is desirable that they’re resolved in a robust way and so appealing to more robust heuristics will be a success criterion in the rubric
- Outcome supervision: should favour resolution by heuristics robustly aligned with “get high score on outcome measure”
- Raemon 21 Oct 2025 22:52 UTC
  2 points
  0
  Parent
  When I have a problem I know how to solve, I use heuristics. When I have a problem I don’t know how to solve, I have to creatively explore action space (using heuristics to guide my search, but, the heuristics are entirely meta-level things like ‘what is the biggest bottleneck’? For the case of the AI ‘I only have X compute to work with’ will be a big bottleneck for most hard things. ‘access to better information, or ability to run experiments’ may be another).
  But, once I get to solving those bottlenecks, those solutions will look more surprising – they necessarily have to come from further afield, because if they came easily, this would be an easy problem, not a hard one, and I’d just solve it using normal heuristics. i.e. we’re specifically taking about places where all the usual obvious things have already failed.
  If you need to solve a complex problem that you could brute force with a billion units of compute, and you only have a million units of compute, and you don’t know how to solve it, you either need to figure out how to get a billion units of compute, or invent a new way of thinking that is outside your current paradigm.
  - David Johnston 22 Oct 2025 2:08 UTC
    3 points
    2
    Parent
    I don’t understand how this answers my question. I agree that if your heuristics are failing you’re more likely to end up with surprising solutions, but I thought we were talking about end goals being random, not the means to achieving them. “Formulate the problem as a search” is an example of what I’d call a “robust heuristic”; I am claiming also that the goal of the problem-formulated-as-a-search is likely to be supplied by robust heuristics. This is completely compatible with the solution being in some respects surprising.
    - Raemon 22 Oct 2025 2:32 UTC
      3 points
      0
      Parent
      Ah, sorry I didn’t understand your question. In the particular section you quoted, I didn’t mean to be saying anything about how End Goals end up random. I only meant that to explain “how does the AI even consider trying to escape the lab in the first place?” (which is a convergent instrumental goal between most possible End Goals)
      I didn’t mean this post to really be talking much at all about how the selection of End Goals (which I think is pretty well covered by the “You don’t Get What You Train for” chapter and FAQ).
      This post is about how
      a) before the AI realizes it might have diverging goals it wants to protect, it’ll be incentivized to start escaping it’s prison just by falling the core training of “try to achieve goals creatively” (which is more likely to be “pseudorandom” in the scenario where it’s trying to solve a very difficult problem)
      and b) the more it starts thinking seriously about it’s goals, the more opportunity it’ll have to notice that it’s goals diverge at least somewhat from humans.
      I’m happy to talk about How AIs Get Weird End Goals if that’s a thing you are currently skeptical of an interested in talking about, but, this post wasn’t focused on that part, more just taking it as a given for now.
      - David Johnston 22 Oct 2025 3:34 UTC
        3 points
        0
        Parent
        It’s helpful to know that we were thinking about different questions, but, like
        
        There is some fact-of-the-matter about what, in practice, Sable’s kludgey mishmash of pseudogoals will actually tend towards. There are multiple ways this could potentially resolve into coherence, path dependent, same as humans.
        
        [...]
        
        It may not have a strong belief that it has any specific goals it wants to pursue, but it’s got some sense that there are some things it wants that humanity wouldn’t give it.
        
        these are claims, albeit soft ones, about what kinds of goals arise, no?
        
        Your FAQ argues theoretically (correctly) that the training data and score function alone don’t determine what AI systems aim for. But this doesn’t tell us we can’t predict anything about End Goals full stop: it just says the answer doesn’t follow directly from the training data.
        
        The FAQ also assumes that AIs actually have “deep drives” but doesn’t explain where they come from or what they’re likely to be. This post discusses how they might arise and I am telling you that you can think about the mechanism you propose here to understand properties of the goals are likely to arise as a result of it^[1]. This addresses a question that the FAQ you link does not: what can we say about what goals are likely to arise?
        
        ↩︎
        Of course, if this mechanism ends up being not very important, we could get very different outcomes.
        
        Raemon 22 Oct 2025 5:04 UTC
        2 points
        0
        Parent
        Yeah, the first paragraph is meant to allude to “there is some kind of fact of the matter” but not argue it’d be any particular thing.
        This post discusses how they might arise and I am telling you that you can think about the mechanism you propose here to understand properties of the goals are likely to arise as a result of it^[1]. This addresses a question that the FAQ you link does not: what can we say about what goals are likely to arise?
        Yeah, I agree there’s some obvious followup worth doing here.
        I agree it’s possible to make informed guesses about what drives will evolve (apart from the convergent instrumental drives, which are more obvious), and that’s an important research question that should get tons of effort. (I think it’s not in the IABIED FAQ because IABIED is focused on the relatively “easy calls”, and this is just straight up a hard call that involves careful research with the epistemic-grounding to avoid falling into various Cope Traps)
        But, one of the “easy calls” is that “it’ll probably be pretty surprising and weird.” Because, while maybe we could have a decently accurate science of sub-human and eventually slightly-superhuman AI, once the AI’s capabilities rise to Extremely Vastly Powerful, they will find ways of achieving their goals that aren’t remotely limited by any of the circumstances of their ‘ancestral environment.’”
        I don’t have immediate followup thoughts on “but how would we do the predicting?” but if you give me a bit more prompting on what directions you think are interesting I could riff on that.
        David Johnston 22 Oct 2025 6:48 UTC
        1 point
        −4
        Parent
        
        I think it’s not in the IABIED FAQ because IABIED is focused on the relatively “easy calls”
        
        IABIED says alignment is basically impossible
        
        Cope Traps
        
        Come on, I’m not doing this to you
        Raemon 22 Oct 2025 7:09 UTC
        4 points
        0
        Parent
        IABIED says alignment is basically impossible
        ....no it doesn’t? Or, I’m not sure how liberal you’re being with the word “basically”, but, this just seems false to me.
        Cope Traps
        Come on, I’m not doing this to you
        The substance of what I mean here is “there is failure mode, exemplified by, say, the scientists studying insects and reproduction who predicted the insects would evolve to have fewer children when there wasn’t enough resources, but what actually happened is they started eating the offspring of rival insects of their species.”
        There will be a significant temptation to predict “what will the AI do?” kinda hoping/expecting particular kinds of outcomes*, instead of straightforward rolling the simulation forward.
        I think it is totally possible to do a good job with this, but, it is a real job requirement to be able to think about it in a detached/unbiased way.
        *which includes, if an AI pessimist were running the experiment, assuming the outcome is always bad, to be clear.