Jeremy Gillen comments on Evaluating Stability of Unreflective Alignment

Jeremy Gillen 13 Nov 2024 14:44 UTC
LW: 2 AF: 1
0
AF
Extremely underrated post, I’m sorry I only skimmed it when it came out.
I found 3a,b,c to be strong and well written, a good representation of my view.
In contrast, 3d I found to be a weak argument that I didn’t identify with. In particular, I don’t think internal conflicts are a good way to explain the source of goal misgeneralization. To me it’s better described as just overfitting or misgeneralization.^[1] Edge cases in goals are clearly going to be explored by a stepping back process, if initial attempts fail. In particular if attempted pathways continue to fail. Whereas thinking of the AI as needing to resolve conflicting values seems to me to be anthropomorphizing in a way that doesn’t seem to transfer to most mind designs.
You also used the word coherent in a way that I didn’t understand.
Human intelligence seems easily useful enough to be a major research accelerator if it can be produced cheaply by AI
I want to flag this as an assumption that isn’t obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.
humans provides a pretty strong intuitive counterexample
It’s a good observation that humans seem better at stepping back inside of low-level tasks than at high-level life-purposes. For example, I got stuck on a default path of finishing a neuroscience degree, even though if I had reflected properly I would have realised it was useless for achieving my goals a couple of years earlier. I got got by sunk costs and normality.
However, I think this counterexample isn’t as strong as you think it is. Firstly because it’s incredibly common for people to break out of a default-path. And secondly because stepping back is usually proceeded by some kind of failure to achieve the goal using a particular approach. Such failures occur often at small scales. They occur infrequently in most people’s high-level life plans, because such plans are fairly easy and don’t often raise flags that indicate potential failure. We want difficult work out of an AI. This implies frequent total failure, and hence frequent high-level stepping back. If it’s doing alignment research, this is particularly true.
1. ^
  Like for reasons given in section 4 of the misalignment and catastrophe doc.
- james.lucassen 14 Nov 2024 2:00 UTC
  LW: 7 AF: 4
  0
  AF Parent
  such plans are fairly easy and don’t often raise flags that indicate potential failure
  Hmm. This is a good point, and I agree that it significantly weakens the analogy.
  I was originally going to counter-argue and claim something like “sure total failure forces you to step back far but it doesn’t mean you have to step back literally all the way”. Then I tried to back that up with an example, such as “when I was doing alignment research, I encountered total failure that forced me to abandon large chunks of planning stack, but this never caused me to ‘spill upward’ to questioning whether or not I should be doing alignment research at all”. But uh then I realized that isn’t actually true :/
  We want particularly difficult work out of an AI.
  On consideration, yup this obviously matters. The thing that causes you to step back from a goal is that goal being a bad way to accomplish its supergoal, aka “too difficult”. Can’t believe I missed this, thanks for pointing it out.
  I don’t think this changes the picture too much, besides increasing my estimate of how much optimization we’ll have to do to catch and prevent value-reflection. But a lot of muddy half-ideas came out of this that I’m interested in chewing on.
  - Jeremy Gillen 14 Nov 2024 11:19 UTC
    LW: 2 AF: 1
    0
    AF Parent
    I’d be curious about why it isn’t changing the picture quite a lot, maybe after you’ve chewed on the ideas. From my perspective it makes the entire non-reflective-AI-via-training pathway not worth pursuing. At least for large scale thinking.
    - james.lucassen 2 Jan 2025 18:23 UTC
      LW: 3 AF: 2
      0
      AF Parent
      It doesn’t change the picture a lot because the proposal for preventing misaligned goals from arising via this mechanism was to try and get control over when the AI does/doesn’t step back, in order to allow it in the capability-critical cases but disallow it in the dangerous cases. This argument means you’ll have more attempts at dangerous stepping back that you have to catch, but doesn’t break the strategy.
      
      The strategy does break if when we do this blocking, the AI piles on more and more effort trying to unblock it until it either succeeds or is rendered useless for anything else. There being more baseline attempts probably raises the chance of that or some other problem that makes prolonged censorship while maintaining capabilities impossible. But again, just makes it harder, doesn’t break it.
      
      I don’t think you need to have that pile-on property to be useful. Consider MTTR(n), the mean time an LLM takes to realize it’s made a mistake, parameterized by how far up the stack the mistake was. By default you’ll want to have short MTTR for all n. But if you can get your MTTR short enough for small n, you can afford to have MTTR long for large n. Basically, this agent tends to get stuck/rabbit-hole/nerd-snipe but only when the mistake that caused it to get stuck was made a long time ago.
      
      Imagine a capabilities scheme where you train MTTR using synthetic data with an explicit stack and intentionally introduced mistakes. If you’re worried about this destabilization threat model, there’s a pretty clear recommendation: only train for small-n MTTR, treat large-n MTTR as a dangerous capability, and you pay some alignment tax in the form of inefficient MTTR training and occasionally rebooting your agent when it does get stuck in a non dangerous case.
      
      Figured I should get back to this comment but unfortunately the chewing continues. Hoping to get a short post out soon with my all things considered thoughts on whether this direction has any legs
      - Jeremy Gillen 14 Mar 2025 16:03 UTC
        2 points
        0
        Parent
        I think the scheme you’re describing caps the agent at moderate problem-solving capabilities. Not being able to notice past mistakes is a heck of a disability.
- james.lucassen 14 Nov 2024 2:03 UTC
  LW: 3 AF: 2
  0
  AF Parent
  I want to flag this as an assumption that isn’t obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.
  humans provides a pretty strong intuitive counterexample
  Yup not obvious. I do in fact think a lot more humans would be helpful. But I also agree that my mental picture of “transformative human level research assistant” relies heavily on serial speedup, and I can’t immediately picture a version that feels similarly transformative without speedup. Maybe evhub or Ethan Perez or one of the folks running a thousand research threads at once would disagree.
- Noosphere89 13 Nov 2024 17:20 UTC
  2 points
  0
  Parent
  I admit, I think this is kind of a crux, but let me get down to this statement:
  I want to flag this as an assumption that isn’t obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.
  One big difference between a human-level AI and a real human is coordination costs: Even without advanced decision theories like FDT/UDT/LDT, the ability to have millions of copies of an AI makes it possible for them to all have similar values, and divergences between them are more controllable in a virtual environment than a physical environment.
  But my more substantive claim is that lots of how progress is made in the real world is because population growth allows for more complicated economies, more ability to specialize without losing essential skills, and just simply more data to deal with reality, and alignment, including strong alignment is not different here.
  Indeed, I’d argue that a lot more alignment progress happened in the 2022-2024 period than the 2005-2015 period, and while I don’t credit it all to population growth of alignment researchers, I do think a reasonably significant amount of the progress happened because we got more people into alignment.
  Intelligence/IQ is always good, but not a dealbreaker as long as you can substitute it with a larger population.
  See these quotes from Carl Shulman here for why:
  Yeah. In science the association with things like scientific output, prizes, things like that, there’s a strong correlation and it seems like an exponential effect. It’s not a binary drop-off. There would be levels at which people cannot learn the relevant fields, they can’t keep the skills in mind faster than they forget them. It’s not a divide where there’s Einstein and the group that is 10 times as populous as that just can’t do it. Or the group that’s 100 times as populous as that suddenly can’t do it. The ability to do the things earlier with less evidence and such falls off at a faster rate in Mathematics and theoretical Physics and such than in most fields.
  Yes, people would have discovered general relativity just from the overwhelming data and other people would have done it after Einstein.
  The link for these quotes is here below:
  
  https://www.lesswrong.com/posts/BdPjLDG3PBjZLd5QY/carl-shulman-on-dwarkesh-podcast-june-2023#Can_we_detect_deception_
  - Jeremy Gillen 14 Nov 2024 12:18 UTC
    3 points
    0
    Parent
    Intelligence/IQ is always good, but not a dealbreaker as long as you can substitute it with a larger population.
    IMO this is pretty obviously wrong. There are some kinds of problem solving that scales poorly with population, just as there are some computations that scale poorly with parallelisation.
    E.g. project euler problems.
    When I said “problems we care about”, I was referring to a cluster of problems that very strongly appear to not scale well with population. Maybe this is an intuitive picture of the cluster of problems I’m referring to.
    - Noosphere89 14 Nov 2024 14:49 UTC
      2 points
      0
      Parent
      On this:
      When I said “problems we care about”, I was referring to a cluster of problems that very strongly appear to not scale well with population. Maybe this is an intuitive picture of the cluster of problems I’m referring to.
      I think the problem identified here is in large part a demand problem, in that lots of AI people only wanted AI capabilities, and didn’t care for AI interpretability at all, so once the scaling happened, a lot of the focus went purely to AI scaling.
      (Which is an interesting example of Goodhart’s law in action, perhaps.)
      See here:
      https://www.lesswrong.com/posts/gXinMpNJcXXgSTEpn/ai-craftsmanship#Qm8Kg7PjZoPTyxrr6
      IMO this is pretty obviously wrong. There are some kinds of problem solving that scales poorly with population, just as there are some computations that scale poorly with parallelisation.
      E.g. project euler problems.
      I definitely agree that there exist such problems where the scaling with population is pretty bad, but I’ll give 2 responses here:
      
      The differences between a human level AI and an actual human are the ability to coordinate and share ontologies better between millions of instances, so the common problems that arise when trying to factorize out problems are greatly reduced.
      I think that while there are serial bottlenecks to lots of problem solving in the real world such that it prevents hyperfast outcomes, I don’t think that serial bottlenecks are the dominating factor, because the stuff that is parallelizable like good execution is often far more valuable than the inherently serial computations like deep/original ideas.