Raemon comments on The Case Against AI Control Research

Raemon 22 Jan 2025 23:36 UTC
10 points
2
I chatted with John about this at a workshop last weekend, and did update noticably, although I haven’t updated all the way to his position here.
What was useful to me were the gears of:
- Scheming has different implications at different stages of AI power.
- Scheming is massively dangerous at high power levels. It’s not very dangerous at lower levels (except insofar as this allows the AI to bootstrap to the higher levels)
At the workshop, we distinguished:
- weak AGI (with some critical threshold of generality, and some spiky competences such that it’s useful, but not better than a pretty-smart-human at taking over the world)
- Von Neumann level AGI (as smart as the smartest humans, or somewhat more so)
- Overwhelming Superintelligence (unbounded optimization power which can lead to all kinds of wonderful and horrible things that we need to deeply understand before running)
I don’t currently buy that control approaches won’t generalize from weak AGI to Von Neumann levels.
It becomes harder, and there might be limitations on what we can get out of the Von Neumann AI because we can’t be that confident in our control scheme. But I think there are lots of ways to leverage an AI that isn’t “it goes and up does long-ranging theorizing of it’s own to come up with wholesale ideas you have to evaluate.”
I think figuring out how to do this involves some pieces that aren’t particularly about “control”, but mostly don’t feel mysterious to me.
(fyi I’m one of the people who asked John to write this)
- Buck 23 Jan 2025 19:34 UTC
  2 points
  0
  Parent
  Yeah, John’s position seems to require “it doesn’t matter whether huge numbers of Von Neumann level AGIs are scheming against you”, which seems crazy to me.
  - johnswentworth 23 Jan 2025 19:47 UTC
    9 points
    2
    Parent
    No, more like a disjunction of possibilities along the lines of:
    The critical AGIs come before huge numbers of von Neumann level AGIs.
    At that level, really basic stuff like “just look at the chain of thought” turns out to still work well enough, so scheming isn’t a hard enough problem to be a bottleneck.
    Scheming turns out to not happen by default in a bunch of von Neumann level AGIs, or is at least not successful at equilibrium (e.g. because the AIs don’t fully cooperate with each other).
    “huge numbers of von Neumann level AGIs” and/or “scheming” turns out to be the wrong thing to picture in the first place, the future is Weirder than that in ways which make our intuitions about von Neumann society and/or scheming not transfer at all.
    Pile together the probability mass on those sorts of things, and it seems far more probable than the prototypical scheming story.
    - Buck 23 Jan 2025 19:56 UTC
      6 points
      0
      Parent
      In your post, you emphasize the slop problem. I think that the slop problem is probably much harder to solve if those AIs are scheming. I guess you’re saying that it’s just unlikely that the AIs are scheming at the point where you’re worried about the slop problem?
      - johnswentworth 23 Jan 2025 20:09 UTC
        9 points
        2
        Parent
        Yeah, basically. Or at least unlikely that they’re scheming enough or competently enough for it to be the main issue.
        For instance, consider today’s AIs. If we keep getting slop at roughly the current level, and scheming at roughly the current level, then slop is going to be the far bigger barrier to using these things to align superintelligence (or nearer-but-strong intelligence).
  - Raemon 23 Jan 2025 20:03 UTC
    4 points
    0
    Parent
    I think part of John’s belief is more like “the current Control stuff won’t transfer to the society of Von Neumanns.”
    And, my belief is “I also not sure that Control Stuff will transfer to large societies of Von Neumanns, but, also, I think you can be deliberate about how many Von Neumanns you’re summoning, and how you summon them, such that you can manage it.” (I’m not quite sure what you, Buck, are imagining yet)
    - Buck 23 Jan 2025 20:09 UTC
      9 points
      0
      Parent
      My sense is that the control techniques we’ve talked about (e.g. these, and some other stuff that we’ve thought of since then) would indeed drastically reduce the risk posed by those Von Neumanns, especially if you run control evaluations with the Von Neumanns.
      For example, consider the problem “I want to ask the AI a question that has a numerical answer and that would take a human 8 hours to answer, and I’m worried that occasionally on a subdomain of its choice it will blatantly lie to me”. If you do no countermeasures, it seems easy for the AI to cause this problem. But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that). So I think that adopting well-tuned variations of basic techniques seems really promising.
      - johnswentworth 24 Jan 2025 16:33 UTC
        9 points
        1
        Parent
        But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that).
        Strong disagree with this. Probably not the most cruxy thing for us, but I’ll briefly walk through the issue because it is a special case of a common cruxy issue with a lot of prosaic alignment work.
        The reason this doesn’t work is that the prototypical “blatant lie” doesn’t look like “the model chooses a random number to output”. The prototypical blatant lie is that there’s a subtle natural mistake one could make in reasoning about the question, the model “knows” that it’s a mistake, but the model just presents an argument with the subtle mistake in it.
        Or, alternatively: the model “knows” what kind of argument would seem most natural to the user, and presents that kind of argument despite “knowing” that it systematically overlooks major problems.
        Or, alternatively: human experts in the field generally believe a particular wrong answer. Maybe there’s a few dissenters, and the model is clever enough to recognize that the dissenters are correct, but also the model was RLHF’d and got positive feedback for matching supposed-experts’ answers even when those experts were wrong, so it outputs the answer and arguments the human experts in the field would generally give.
        These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal “knows” the right answer in some sense.
        Buck 24 Jan 2025 16:41 UTC
        7 points
        0
        Parent
        Yeah I strongly disagree that this failure mode is very likely to be very hard to mostly resolve. I’m doing some research right now that will hopefully shed some light on this!
        johnswentworth 24 Jan 2025 17:36 UTC
        6 points
        6
        Parent
        To be clear, I am not claiming that this failure mode is very likely very hard to resolve. Just harder than “run it twice on the original question and a rephrasing/transformation of the question”.
        ryan_greenblatt 25 Jan 2025 2:13 UTC
        4 points
        0
        Parent
        
        The reason this doesn’t work is that the prototypical “blatant lie” doesn’t look like “the model chooses a random number to output”. The prototypical blatant lie is that there’s a subtle natural mistake one could make in reasoning about the question, the model “knows” that it’s a mistake, but the model just presents an argument with the subtle mistake in it.
        
        [...]
        
        These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal “knows” the right answer in some sense.
        
        As in, the claim is that there is almost always a “schelling” mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn’t make much difference?
        
        (“Almost always” is because a key thing (that Buck didn’t specify) is that if you catch suspicious behavior you can investigate and/or train. I think the question of how useful it is to catch a large number of cases where the AI does this might be an improtant crux.)
        
        Supposing it is the schelling mistake claim, I’m skeptical.
        
        I also think that we could directly look for potential schelling mistake areas and investigate these specific points in more detail. This might mean the AI also needs to consistently find schelling mistakes which are non-obvious to humans or needs to be very robust to our more targeted techniques applied to the potential problem spots.
        johnswentworth 25 Jan 2025 3:30 UTC
        6 points
        0
        Parent
        As in, the claim is that there is almost always a “schelling” mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn’t make much difference?
        The latter.
    - johnswentworth 23 Jan 2025 20:13 UTC
      6 points
      2
      Parent
      I think part of John’s belief is more like “the current Control stuff won’t transfer to the society of Von Neumanns.”
      That is also separately part of my belief, medium confidence, depends heavily on which specific thing we’re talking about. Notably that’s more a criticism of broad swaths of prosaic alignment work than of control specifically.