Zach Stein-Perlman comments on The Field of AI Alignment: A Postmortem, and What To Do About It

Zach Stein-Perlman 28 Dec 2024 0:16 UTC
5 points
0
Yeah. I agree/concede that you can explain why you can’t convince people that their own work is useless. But if you’re positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.
- TsviBT 28 Dec 2024 1:26 UTC
  20 points
  4
  Parent
  The flinches aren’t structureless particulars. Rather, they involve warping various perceptions. Those warped perceptions generalize a lot, causing other flaws to be hidden.
  
  As a toy example, you could imagine someone attached to the idea of AI boxing. At first they say it’s impossible to break out / trick you / know about the world / whatever. Then you convince them otherwise—that the AI can do RSI internally, and superhumanly solve computer hacking / protein folding / persuasion / etc. But they are attached to AI boxing. So they warp their perception, clamping “can an AI be very superhumanly capable” to “no”. That clamping causes them to also not see the flaws in the plan “we’ll deploy our AIs in a staged manner, see how they behave, and then recall them if they behave poorly”, because they don’t think RSI is feasible, they don’t think extreme persuasion is feasible, etc.
  
  A more real example is, say, people thinking of “structures for decision making”, e.g. constitutions. You explain that these things, they are not reflectively stable. And now this person can’t understand reflective stability in general, so they don’t understand why steering vectors won’t work, or why lesioning won’t work, etc.
  
  Another real but perhaps more controversial example: {detecting deception, retargeting the search, CoT monitoring, lesioning bad thoughts, basically anything using RL} all fail because creativity starts with illegible concomitants to legible reasoning.
  
  (This post seems to be somewhat illegible, but if anyone wants to see more real examples of aspects of mind that people fail to remember, see https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html)
  - johnswentworth 28 Dec 2024 23:09 UTC
    9 points
    0
    Parent
    My impression, from conversations with many people, is that the claim which gets clamped to True is not “this research direction will/can solve alignment” but instead “my research is high value”. So when I’ve explained to someone why their current direction is utterly insufficient, they usually won’t deny some class of problems. They’ll instead tell me that the research still seems valuable even though it isn’t addressing a bottleneck, or that their research is maybe a useful part of a bigger solution which involves many other parts, or that their research is maybe useful step toward something better.
    (Though admittedly I usually try to “meet people where they’re at”, by presenting failure-modes which won’t parse as weird to them. If you’re just directly explaining e.g. dangers of internal RSI, I can see where people might instead just assume away internal RSI or some such.)
    … and then if I were really putting in effort, I’d need to explain that e.g. being a useful part of a bigger solution (which they don’t know the details of) is itself a rather difficult design constraint which they have not at all done the work to satisfy. But usually I wrap up the discussion well before that point; I generally expect that at most one big takeaway from a discussion can stick, and if they already have one then I don’t want to overdo it.
    - TsviBT 29 Dec 2024 1:54 UTC
      8 points
      2
      Parent
      
      the claim which gets clamped to True is not “this research direction will/can solve alignment” but instead “my research is high value”.
      
      This agrees with something like half of my experience.
      
      that their research is maybe a useful part of a bigger solution which involves many other parts, or that their research is maybe useful step toward something better.
      
      Right, I think of this response as arguing that streetlighting is a good way to do large-scale pre-paradigm science projects in general. And I have to somewhat agree with that.
      
      Then I argue that AGI alignment is somewhat exceptional: 1. cruel deadline, 2. requires understanding as-yet-unconceived aspects of Mind. Point 2 of exceptionality goes through things like alienness of creativity, RSI, reflective instability, the fact that we don’t understand how values sit in a mind, etc., and that’s the part that gets warped away.
      
      I do genuinely think that the 2024 field of AI alignment would eventually solve the real problems via collective iterative streetlighting. (I even think it would eventually solve it in a hypothetical world where all our computers disappeared, if it kept trying.) I just think it’ll take a really long time.
      
      being a useful part of a bigger solution (which they don’t know the details of) is itself a rather difficult design constraint which they have not at all done the work to satisfy
      
      Right, exactly. (I wrote about this in my ~~opaque gibberish~~ philosophically precise style here: https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html#1-summary)