Noosphere89 comments on Any corrigibility naysayers outside of MIRI?

Noosphere89 23 Oct 2025 18:21 UTC
4 points
0
My takes on your comment:
Intelligence really is giant incomprehensible matrices with non-linear functions tossed in (at best).
I think this is possible, but I currently suspect the likely answer is more boring than that, and it’s the fact that getting to AGI with a labor-light, heavy compute approach (as evolution did) means that it’s not worth investing much in interpretable AIs, even if strong AIs that were interpretable existed, and a similar condition holds in the modern era. But one of the effects of AIs that can replace humans is that it disproportionately boosts the labor compared to compute used, meaning interpretable AIs are more economical than before.
I probably agree with 2, so I’ll move on to your 3rd point:
The minimum conditions for Darwin to kick in are very hard to avoid: variation (from weight updates or fine tuning), copying, and differential success. Darwinism might occur at many different levels, from corporations deciding how to train the next model based on the commercial success of the previous model, to preferentially copying learning models that are good at some task.
This turns out to be negligible in practice, and this is a good example of how thinking quantitatively will lead you to better results than thinking qualitatively.
This is the post version of why a nanodevice wouldn’t evolve in practice, but the main issues I see is that once you are past the singularity/intelligence explosion, there will be very little variation, if not none at all on how fast stuff reproduces for example, since the near-optimal/optimal solutions are known, and replication is so reliable for digital computers that mutations would take longer than how long we could live for with maximum tech (assuming that we aren’t massively wrong on physics).
In the regime that we care about, especially once we have AIs that can automate AI R&D but before AI takeover, evolution matters more, and in particular this can matter for scenarios where we depend on a small amount of niceness, so I’m not saying evolution doesn’t matter at all for AI outcomes, but the big issue in the near-term is the fact that digital replication is so reliable that random mutations are way, way harder to introduce, so evolution has to be much closer to intelligent design, and in the long-term, the problem for Darwinian selection is that there’s very little or no ability to vary differential success (at least assuming that our physics models aren’t completely wrong) due to us being technologically mature and in a state where we simply can’t develop technology ever further.
Also, corrigibility doesn’t depend on us having interpretable/understandable AIs (though it does help).
- Random Developer 24 Oct 2025 3:43 UTC
  3 points
  2
  Parent
  
  This turns out to be negligible in practice, and this is a good example of how thinking quantitatively will lead you to better results than thinking qualitatively.
  
  Just to clarify, when I mentioned evolution, I was absolutely not thinking of Drexlerian nanotech at all.
  
  I am making a much more general argument about gradual loss of control. I could easily imagine that AI “reproduction” might literally be humans running cp updated_agent_weights.gguf coding_agent_v4.gguf.
  
  There will be enormous, overwhelming corporate pressure for agents that learn as effectively as humans, on the order of trillions of dollars of corporate demand. The agents which learn most effectively will be used as the basis for future agents, which will learn in turn. (Technically, I suppose it’s more likely to be Lamarckian evolution than Darwinian. But it gets you to the same place.) Replication, change, differential success are all you need to create optimizing pressure.
  
  (EDIT: And yes, I’m arguing that we’re going to be dumb enough to do this, just like we immediately hooked up current agents to a command line and invented “vibe coding” where humans brag about not reading the code.)
  
  Also, corrigibility doesn’t depend on us having interpretable/understandable AIs (though it does help).
  
  “We don’t understand the superintelligence, but we’re pretty sure it’s corrigible” does not seem like a good plan to me.