evhub comments on Human takeover might be worse than AI takeover

evhub 12 Jan 2025 2:34 UTC
44 points
17
I think this is correct in alignment-is-easy worlds but incorrect in alignment-is-hard worlds (corresponding to “optimistic scenarios” and “pessimistic scenarios” in Anthropic’s Core Views on AI Safety). Logic like this is a large part of why I think there’s still substantial existential risk even in alignment-is-easy worlds, especially if we fail to identify that we’re in an alignment-is-easy world. My current guess is that if we were to stay exclusively in the pre-training + small amounts of RLHF/CAI paradigm, that would constitute a sufficiently easy world that this view would be correct, but in fact I don’t expect us to stay in that paradigm, and I think other paradigms involving substantially more outcome-based RL (e.g. as was used in OpenAI o1) are likely to be much harder, making this view no longer correct.
What links here?
- eggsyntax's comment on Human takeover might be worse than AI takeover by Tom Davidson (13 Jan 2025 16:40 UTC; 22 points)
- Tom Davidson 12 Jan 2025 15:13 UTC
  5 points
  0
  Parent
  I agree the easy vs hard worlds influence the chance of AI taking over.
  But are you also claiming it influences the badness of takeover conditional on it happening? (That’s the subject of my post)
  - evhub 13 Jan 2025 1:49 UTC
    5 points
    3
    Parent
    I think it affects both, since alignment difficulty determines both the probability that the AI will have values that cause it to take over, as well as the expected badness of those values conditional on it taking over.
- Noosphere89 12 Jan 2025 5:30 UTC
  4 points
  0
  Parent
  I agree that things would be harder, mostly because of the potential for sudden capabilities breakthroughs if you have RL, combined with incentives to use automated rewards more, but I don’t think it’s so much harder that the post is incorrect, and my basic reason is I believe the central alignment insights like data mattering a lot more than inductive bias for alignment purposes still remain true in the RL regime, so we can control values by controlling data.
  
  Also, depending on your values, AI extinction can be preferable to some humans taking over if they are willing to impose severe suffering on you, which can definitely happen if humans align AGI/ASI.
  - Bronson Schoen 12 Jan 2025 10:59 UTC
    1 point
    0
    Parent
    so we can control values by controlling data.
    What do you mean? As in you would filter specific data from the posttraining step? What would you be trying to prevent the model from learning specifically?
    - Noosphere89 12 Jan 2025 15:35 UTC
      4 points
      0
      Parent
      I was thinking of adding synthetic data about our values in the pretraining step.
      - Bronson Schoen 13 Jan 2025 23:49 UTC
        1 point
        0
        Parent
        Is anyone doing this?
        Noosphere89 13 Jan 2025 23:58 UTC
        2 points
        0
        Parent
        Maybe, but not at any large scale, though I get why someone might not want to do it, because it’d probably be costly to do this as an intentional effort to stably make an AI aligned (unless synthetic data automation more or less works.)