Adrià Garriga-alonso comments on Alignment remains a hard, unsolved problem

Adrià Garriga-alonso 28 Nov 2025 23:08 UTC
2 points
−13
AF
Thank you for the reply. I want to engage without making it tiresome. The problem is that there are many things I disagree with in the worldview, the disagreement isn’t reducible to 1-5 double cruxes, but here are some candidates for the biggest cruxes for me. If any of these are wrong it’s bad news for my current view:
- N->N+1 alignment:
  - will let humans align N+1 in cases where we can still check, with less and less effort.
  - is stable instead of diverging (the values of N+1 don’t drift arbitrarily far apart)
- The N->N+1 improvement will continue to give out linear-ish improvements in perceived intelligence. We might get one larger jump or two, but it won’t continuously accelerate.
  - (a good analogy for this is loudness perception being logarithmic in sound pressure. Actual intelligence is logarithmic in METR time-horizon graph.)
- Aligned-persona-seeming models won’t give out false AI safety research results, without making it visible on a CoT or latent reasoning.
  - (It’s perhaps possible to refrain from doing your best (sandbagging) but that doesn’t have nearly as bad effects, so it doesn’t count for this.)
And here’s another prediction where I really stick my neck out, which isn’t load-bearing to the view, but still increases my confidence, so defeating it is important:
- we can to a significant extent train with RL against model internals (probes) and textual evaluations from other models, without ill effects.
  - That is, we ask the N model to evaluate N+1, giving test-time compute to N, and train against that. (we continuously also finetune N to predict N+1′s relevant behavior better).
  - we also train linear probes and update them during the RL.
  - Effectively I’m claiming these things are good enough or they’re self-reinforcing when model is already ~aligned, so that effectively Goodhart’s Law is a poor description of reality.
I still disagree with several of the points, but for time reasons I request that readers not update against Evan’s points if he just doesn’t reply to these.
- disagree that increasing capabilities are exponential in a capability sense. It’s true that METR’s time horizon plot increases exponentially, but this still corresponds to a linear intuitive intelligence. (Like loudness (logarithmic) and sound pressure (exponential); we handle huge ranges well.) Each model that has come out has exponentially larger time horizon but is not (intuitively empirically) exponentially smarter.
- “we still extensively rely on direct human oversight and review to catch alignment issues” That’s a fair point and should decrease confidence in my view, though I expected it. For properly testing sandwiching we’ll probably have to wait till models are superhuman, or use weak models + less weak models and test it out. Unfortunately perhaps the weak models are still too weak. But we’ve reached the point where you can maybe just use the actual Opus 3 as the weak model?
- If we have a misaligned model doing research, we have lots of time to examine it with the previous model. I also do expect to see sabotage in the CoT or in deception probes
- I updated way down on Goodharting on model internals due to Cundy and Gleave.
Again, readers please don’t update down on these due to lack of a response.