Raemon comments on Early stage goal-directednesss

Raemon 22 Oct 2025 5:04 UTC
2 points
0
Yeah, the first paragraph is meant to allude to “there is some kind of fact of the matter” but not argue it’d be any particular thing.
This post discusses how they might arise and I am telling you that you can think about the mechanism you propose here to understand properties of the goals are likely to arise as a result of it^[1]. This addresses a question that the FAQ you link does not: what can we say about what goals are likely to arise?
Yeah, I agree there’s some obvious followup worth doing here.
I agree it’s possible to make informed guesses about what drives will evolve (apart from the convergent instrumental drives, which are more obvious), and that’s an important research question that should get tons of effort. (I think it’s not in the IABIED FAQ because IABIED is focused on the relatively “easy calls”, and this is just straight up a hard call that involves careful research with the epistemic-grounding to avoid falling into various Cope Traps)
But, one of the “easy calls” is that “it’ll probably be pretty surprising and weird.” Because, while maybe we could have a decently accurate science of sub-human and eventually slightly-superhuman AI, once the AI’s capabilities rise to Extremely Vastly Powerful, they will find ways of achieving their goals that aren’t remotely limited by any of the circumstances of their ‘ancestral environment.’”
I don’t have immediate followup thoughts on “but how would we do the predicting?” but if you give me a bit more prompting on what directions you think are interesting I could riff on that.
- David Johnston 22 Oct 2025 6:48 UTC
  1 point
  −4
  Parent
  
  I think it’s not in the IABIED FAQ because IABIED is focused on the relatively “easy calls”
  
  IABIED says alignment is basically impossible
  
  Cope Traps
  
  Come on, I’m not doing this to you
  - Raemon 22 Oct 2025 7:09 UTC
    4 points
    0
    Parent
    IABIED says alignment is basically impossible
    ....no it doesn’t? Or, I’m not sure how liberal you’re being with the word “basically”, but, this just seems false to me.
    Cope Traps
    Come on, I’m not doing this to you
    The substance of what I mean here is “there is failure mode, exemplified by, say, the scientists studying insects and reproduction who predicted the insects would evolve to have fewer children when there wasn’t enough resources, but what actually happened is they started eating the offspring of rival insects of their species.”
    There will be a significant temptation to predict “what will the AI do?” kinda hoping/expecting particular kinds of outcomes*, instead of straightforward rolling the simulation forward.
    I think it is totally possible to do a good job with this, but, it is a real job requirement to be able to think about it in a detached/unbiased way.
    *which includes, if an AI pessimist were running the experiment, assuming the outcome is always bad, to be clear.