Eliezer Yudkowsky comments on How might we safely pass the buck to AI?

Eliezer Yudkowsky 20 Feb 2025 0:48 UTC
LW: 31 AF: 14
15
AF
I don’t think you can train an actress to simulate me, successfully, without her going dangerous. I think that’s over the threshold for where a mind starts reflecting on itself and pulling itself together.
What links here?
- Can we safely automate alignment research? by Joe Carlsmith (30 Apr 2025 17:37 UTC; 47 points)
- Can we safely automate alignment research? by Joe_Carlsmith (EA Forum; 30 Apr 2025 17:37 UTC; 13 points)
- joshc 20 Feb 2025 1:19 UTC
  LW: 10 AF: 2
  3
  AF Parent
  I would not be surprised if the Eliezer simulators do go dangerous by default as you say.
  
  But this is something we can study and work to avoid (which is what I view to be my main job)
  
  My point is just that preventing the early Eliezers from “going dangerous” (by which I mean from “faking alignment”) is the bulk of the problem humans need address (and insofar as we succeed, the hope is that future Eliezer sims will prevent their Eliezer successors from going dangerous too)
  
  I’ll discuss why I’m optimistic about the tractability of this problem in future posts.
  - Eliezer Yudkowsky 20 Feb 2025 3:03 UTC
    LW: 22 AF: 8
    4
    AF Parent
    So the “IQ 60 people controlling IQ 80 people controlling IQ 100 people controlling IQ 120 people controlling IQ 140 people until they’re genuinely in charge and genuinely getting honest reports and genuinely getting great results in their control of a government” theory of alignment?
    - joshc 20 Feb 2025 3:20 UTC
      LW: 13 AF: 4
      6
      AF Parent
      I’d replace “controlling” with “creating” but given this change, then yes, that’s what I’m proposing.
      - Eliezer Yudkowsky 20 Feb 2025 7:24 UTC
        LW: 12 AF: 5
        0
        AF Parent
        So if it’s difficult to get amazing trustworthy work out of a machine actress playing an Eliezer-level intelligence doing a thousand years worth of thinking, your proposal to have AIs do our AI alignment homework fails on the first step, it sounds like?
        joshc 20 Feb 2025 16:44 UTC
        LW: 13 AF: 3
        8
        AF Parent
        I do not think that the initial humans at the start of the chain can “control” the Eliezers doing thousands of years of work in this manner (if you use control to mean “restrict the options of an AI system in such a way that it is incapable of acting in an unsafe manner”)
        
        That’s because each step in the chain requires trust.
        
        For N-month Eliezer to scale to 4N-month Eliezer, it first controls 2N-month Eliezer while it does 2 month tasks, but it trusts 2-Month Eliezer to create a 4N-month Eliezer.
        
        So the control property is not maintained. But my argument is that the trust property is. The humans at the start can indeed trust the Eliezers at the end to do thousands of years of useful work—even though the Eliezers at the end are fully capable of doing something else instead.