Jeremy Gillen comments on JDP Reviews IABIED

Jeremy Gillen 19 Sep 2025 6:07 UTC
29 points
6
This post from Gillen and Barnett that I always struggle to find every time I search for it is a decent overview.
In retrospect the title is ridiculous, I have trouble remembering it. I apologise.
overinvesting in concepts like deceptive mesaoptimizers and recipes for ruin to create almost unfalsifiable, obscurantist shoggoth of the gaps arguments against neural gradient methods
[...]
The word “mesaoptimizer” does not appear anywhere in the book.
I think there’s a really common misunderstanding about deceptive mesaoptimizers that maybe you have. The kind of reasoning agent that the book describes is exactly a deceptive mesaoptimizer. It’s a mesaoptimizer because it does planning (a kind of optimization) and was created using gradient descent (the outer optimizer), and it is deceptive in the book scenario when it hides its nature from its creators.
My guess is that you’re thinking of thinking of stuff like optimization daemons and the malign prior, where there’s an agent that shows up in a place where it wasn’t intended to show up. I think the similarities caused a bunch of mixed up ideas on lesswrong and elsewhere.
I think when you say that the book brings the authors closer to your threat model, some of this might be that they were always closer to your threat model and you misunderstood them?
These thinkers gambled everything on a vast space of minds that doesn’t actually exist in practice and lost.
Like I don’t think they ever meant anything different by the “vast space of minds” than what is described in the book.
My meta-critique of the book would be that Yudkowsky already has an extensive corpus of writing about AGI ruin, much of it quite good. I do not just mean The Sequences, I am talking about his Arbital posts, his earlier whitepapers at MIRI like Intelligence Explosion Microeconomics, and other material which he has spent more effort writing than advertising and as a result almost nobody has read it besides me.
Yeah agreed, it’s bizarre how many alignment researchers haven’t read arbital or old MIRI papers. They mostly do hold up well, imo, if you’re trying to understand the highest hurdles of alignment. I am a bit sad that the book didn’t go into arbital-style alignment theory very deeply, but I can see why.
- jdp 19 Sep 2025 6:18 UTC
  18 points
  1
  Parent
  
  My guess is that you’re thinking of thinking of stuff like optimization daemons and the malign prior, where there’s an agent that shows up in a place where it wasn’t intended to show up. I think the similarities caused a bunch of mixed up ideas on lesswrong and elsewhere.
  
  I honestly just remember a lot of absurd posts spending their time thinking about daemons in the weights which were based on a model of gradient descent as being evolution-like in ways which it is not and the absurdity of said posts absolutely contributed to the alignment winter by giving people the impression that they’re blocked on impossible seeming problems that don’t actually exist and then focusing their attention somewhere else. MIRI cluster very much contributed to this and I consider this book, to the extent it’s talking about something with the same name to be a retcon.
  
  I agree that the strict literal words “deceptive mesaoptimizer” mean what you say they do, but also that is not really what people meant by it until fairly recently when they had to retcon the embarrassing alien shoggoth stuff. It almost always meant deceptive mesaoptimization daemon as subset of the network undermining the training goal.
  
  Like I don’t think they ever meant anything different by the “vast space of minds” than what is described in the book.
  
  I am quite certain they did. In any case there does exist a large space of output heads on top of the shared ontology so it doesn’t really matter that much. I think the alienness of the minds involved is a total red herring, they could be very hominid-like and it wouldn’t matter much if they include superintelligent planners doing argmax(p(problem_solved)).
  - Jeremy Gillen 19 Sep 2025 6:48 UTC
    12 points
    0
    Parent
    I think the alienness of the minds involved is a total misnomer, they could be very hominid-like and it wouldn’t matter much if they include superintelligent planners doing argmax(p(problem_solved)).
    Yeah I agree with this. Although I think focusing on argmax confused a lot of people (including me) and I’m glad they didn’t do that in the book. When I was new to the community, I thought that implementing soft optimization would solve the main problems. I didn’t grok how large the reflective instability and pointer problems were.
    I honestly just remember a lot of absurd posts spending their time thinking about daemons in the weights which were based on a model of gradient descent as being evolution-like in ways which it is not and the absurdity of said posts absolutely contributed to the alignment winter by giving people the impression that they’re blocked on impossible seeming problems that don’t actually exist and then focusing their attention somewhere else.
    Yeah I agree that this happened. But if there was a retcon, then it would be in RFLO, not in the book, because RFLO defined mesaoptimization in a way that doesn’t match “daemons in the weights”. I think what happened was maybe closer to “lots of wild speculation about weird ways that overpowered optimizers might go wrong”, which, as people became less confused, was consolidated into something much more reasonable and less wild (which was RFLO). But then lots of people mentally attached the word mesaoptimizer to older ideas.
    I think the issue is exacerbated by the way that when people post about alignment, they often have a detailed AGI design in their mind, and they are talking about alignment issues with that AGI design. But the AGI design isn’t described in much detail or at all. And over the last two decades the AGI designs that people have had in mind have varied wildly, and many of them have been pretty silly.
    - jdp 19 Sep 2025 7:40 UTC
      7 points
      0
      Parent
      
      I think the issue is exacerbated by the way that when people post about alignment, they often have a detailed AGI design in their mind, and they are talking about alignment issues with that AGI design. But the AGI design isn’t described in much detail or at all. And over the last two decades the AGI designs that people have had in mind have varied wildly, and many of them have been pretty silly.
      
      I agree with this and don’t mind saying for future reference that my current AGI model is in fact a traditional RL agent with a planner and a policy where the policy is some LLM-like foundation model and the planner is something MCTS-like over ReAct-like blocks. The agent rewards itself by taking motor actions and then checking whether the action succeeded with evaluation actions that return a boolean result to assess subgoal completion.
      
      So, MuZero but with LLMs basically.