sunwillrise comments on Daniel Kokotajlo’s Shortform

sunwillrise 31 Jul 2025 15:46 UTC
8 points
2
There are probably problems with it
The basic problems with it were mentioned by Rohin Shah a long time ago:
I like what you call “complicated schemes” over “retarget the search” for two main reasons:
1. They don’t rely on the “mesa-optimizer assumption” that the model is performing retargetable search (which I think will probably be false in the systems we care about).
2. They degrade gracefully with worse interpretability tools, e.g. in debate, even if the debaters can only credibly make claims about whether particular neurons are activated, they can still stay stuff like “look my opponent is thinking about synthesizing pathogens, probably it is hoping to execute a treacherous turn”, whereas “Retarget the Search” can’t use this weaker interpretability at all. (Depending on background assumptions you might think this doesn’t reduce x-risk at all; that could also be a crux.)
Nate Soares has written stuff before which touches on point 1 as well:^[1]
By default, the first minds humanity makes will be a terrible spaghetti-code mess, with no clearly-factored-out “goal” that the surrounding cognition pursues in a unified way. The mind will be more like a pile of complex, messily interconnected kludges, whose ultimate behavior is sensitive to the particulars of how it reflects and irons out the tensions within itself over time.
Making the AI even have something vaguely nearing a ‘goal slot’ that is stable under various operating pressures (such as reflection) during the course of operation, is an undertaking that requires mastery of cognition in its own right—mastery of a sort that we’re exceedingly unlikely to achieve if we just try to figure out how to build a mind, without filtering for approaches that are more legible and aimable.
1. ^
  I personally agree with the first paragraph below, and (contra Soares) I believe the second is mostly irrelevant for purposes of AI safety
- Noosphere89 31 Jul 2025 21:15 UTC
  4 points
  0
  Parent
  One area where I’ve changed my mind compared to my previous beliefs is that I now think for the systems we care about, capabilities either come from a relatively expensive General Purpose Search or something like a mesa-optimizer that does the GPS indirectly, and I no longer believe that mostly imitating humans will be the largest source of AI capabilities with high probability (I’d now put 35% on either a non-mesaoptimizer model or a non-GPS model say automating away AI research as an example with practical compute and data, compared to my over >50% back in 2023.)
  
  Part of this comes down to me believing that AI companies will pay the inefficiency in compute to have models implement General Purpose Search more, and another part of this is I’m much more bearish on pure LLM capabilities than I used to, and in particular I think the reasons why current AIs aren’t mesa-optimizers (for the model-free version of RL) or have a General Purpose Search (for the model-based version of RL) map on pretty well to the reasons why current AIs are much less performant than benchmarks imply, which is that currently in-context learning is way too weak and context windows so far have not been enough to allow the LLM to compound stuff over months or years of thinking, and allow it to deal with problems that aren’t tag-teamable.
  
  More generally, I expect more coherence in AI than I used to, due to my view that labs will spend more on compute to gain semi-reliable insights in AIs.