Steven Byrnes comments on The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

Steven Byrnes 3 Jun 2025 10:50 UTC
2 points
0
I was trying to argue in favor of:
CLAIM: there are AI capabilities things that cannot be done without RL training (or something functionally equivalent to RL training).
It seems to me that, whether this claim is true or false, it has nothing to do with alignment, right?
- RogerDearnaley 3 Jun 2025 21:15 UTC
  2 points
  0
  Parent
  There are certainly things that it’s easier to do with RL — whether it’s ever an absolute requirement I’m less sure. One other commenter has implied that someone has proven that RL always has non-RL equivalent alternatives, but if that’s the case I’m not familiar with the details — I’d love references to anything relevant to this, if anyone has them.
  
  My claim is that using RL to align an unaligned LLM smarter than us is likely to be impossible to do safely/reliably (and especially so for online RL), but that fortunately, aligning an LLM by pretraining or finetuning is possible, and logistically is not very different in difficulty from using offline RL. Functionally, it’s basically equivalent to offline RL plus a satisficing approach to the rating that keeps the behavior inside the training distribution so avoids Goodharting issues.