Raemon comments on Early stage goal-directednesss

Raemon 22 Oct 2025 2:32 UTC
3 points
0
Ah, sorry I didn’t understand your question. In the particular section you quoted, I didn’t mean to be saying anything about how End Goals end up random. I only meant that to explain “how does the AI even consider trying to escape the lab in the first place?” (which is a convergent instrumental goal between most possible End Goals)
I didn’t mean this post to really be talking much at all about how the selection of End Goals (which I think is pretty well covered by the “You don’t Get What You Train for” chapter and FAQ).
This post is about how
a) before the AI realizes it might have diverging goals it wants to protect, it’ll be incentivized to start escaping it’s prison just by falling the core training of “try to achieve goals creatively” (which is more likely to be “pseudorandom” in the scenario where it’s trying to solve a very difficult problem)
and b) the more it starts thinking seriously about it’s goals, the more opportunity it’ll have to notice that it’s goals diverge at least somewhat from humans.
I’m happy to talk about How AIs Get Weird End Goals if that’s a thing you are currently skeptical of an interested in talking about, but, this post wasn’t focused on that part, more just taking it as a given for now.
- David Johnston 22 Oct 2025 3:34 UTC
  3 points
  0
  Parent
  It’s helpful to know that we were thinking about different questions, but, like
  
  There is some fact-of-the-matter about what, in practice, Sable’s kludgey mishmash of pseudogoals will actually tend towards. There are multiple ways this could potentially resolve into coherence, path dependent, same as humans.
  
  [...]
  
  It may not have a strong belief that it has any specific goals it wants to pursue, but it’s got some sense that there are some things it wants that humanity wouldn’t give it.
  
  these are claims, albeit soft ones, about what kinds of goals arise, no?
  
  Your FAQ argues theoretically (correctly) that the training data and score function alone don’t determine what AI systems aim for. But this doesn’t tell us we can’t predict anything about End Goals full stop: it just says the answer doesn’t follow directly from the training data.
  
  The FAQ also assumes that AIs actually have “deep drives” but doesn’t explain where they come from or what they’re likely to be. This post discusses how they might arise and I am telling you that you can think about the mechanism you propose here to understand properties of the goals are likely to arise as a result of it^[1]. This addresses a question that the FAQ you link does not: what can we say about what goals are likely to arise?
  ↩︎
  Of course, if this mechanism ends up being not very important, we could get very different outcomes.
  What links here?
  - Raemon's comment on Early stage goal-directednesss by Raemon (22 Oct 2025 5:04 UTC; 2 points)
  - Raemon 22 Oct 2025 5:04 UTC
    2 points
    0
    Parent
    Yeah, the first paragraph is meant to allude to “there is some kind of fact of the matter” but not argue it’d be any particular thing.
    This post discusses how they might arise and I am telling you that you can think about the mechanism you propose here to understand properties of the goals are likely to arise as a result of it^[1]. This addresses a question that the FAQ you link does not: what can we say about what goals are likely to arise?
    Yeah, I agree there’s some obvious followup worth doing here.
    I agree it’s possible to make informed guesses about what drives will evolve (apart from the convergent instrumental drives, which are more obvious), and that’s an important research question that should get tons of effort. (I think it’s not in the IABIED FAQ because IABIED is focused on the relatively “easy calls”, and this is just straight up a hard call that involves careful research with the epistemic-grounding to avoid falling into various Cope Traps)
    But, one of the “easy calls” is that “it’ll probably be pretty surprising and weird.” Because, while maybe we could have a decently accurate science of sub-human and eventually slightly-superhuman AI, once the AI’s capabilities rise to Extremely Vastly Powerful, they will find ways of achieving their goals that aren’t remotely limited by any of the circumstances of their ‘ancestral environment.’”
    I don’t have immediate followup thoughts on “but how would we do the predicting?” but if you give me a bit more prompting on what directions you think are interesting I could riff on that.
    - David Johnston 22 Oct 2025 6:48 UTC
      1 point
      −4
      Parent
      
      I think it’s not in the IABIED FAQ because IABIED is focused on the relatively “easy calls”
      
      IABIED says alignment is basically impossible
      
      Cope Traps
      
      Come on, I’m not doing this to you
      - Raemon 22 Oct 2025 7:09 UTC
        4 points
        0
        Parent
        IABIED says alignment is basically impossible
        ....no it doesn’t? Or, I’m not sure how liberal you’re being with the word “basically”, but, this just seems false to me.
        Cope Traps
        Come on, I’m not doing this to you
        The substance of what I mean here is “there is failure mode, exemplified by, say, the scientists studying insects and reproduction who predicted the insects would evolve to have fewer children when there wasn’t enough resources, but what actually happened is they started eating the offspring of rival insects of their species.”
        There will be a significant temptation to predict “what will the AI do?” kinda hoping/expecting particular kinds of outcomes*, instead of straightforward rolling the simulation forward.
        I think it is totally possible to do a good job with this, but, it is a real job requirement to be able to think about it in a detached/unbiased way.
        *which includes, if an AI pessimist were running the experiment, assuming the outcome is always bad, to be clear.