Zack_M_Davis comments on The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

Zack_M_Davis 2 Jun 2025 16:10 UTC
8 points
5

Can you name any way to solve [chess but with rooks and bishops not being able to move more than four squares at a time] without RL (or something functionally equivalent to RL)?

This isn’t even hard. Just take a pre-2017 chess engine, and edit the rules code so that rooks and bishops can only move four spaces. You’re probably already done: the core minimax search still works, α–β pruning still works, quiescence still works, &c. To be fair, the heuristic evaluation function won’t be correct, but you could just … make bishops and rooks be respectively worth 2.5 and 3.5 points instead of the traditional 3 and 5? Even if my guess at those point values is wrong, that should still be easily superhuman with 2017 algorithms on 2017 hardware. (Stockfish didn’t incorporate neural networks until 2020.)
- RogerDearnaley 3 Jun 2025 2:47 UTC
  2 points
  0
  Parent
  Incidentally, there are a great many variant versions of chess with different piece-move rules (collectively sometimes called “fairy chess”), and I think even quite a lot of collected games for some of the more popular rule variants. Training an AI to play many types of fairy chess, and even arbitrary new just-invented ones, might be an interesting project that covers some aspects of generalizing out-of-distribution and positive transfer. A suitably-edited-for-the-variant version of Stockfish makes a pretty strong baseline for this. Using AlphaZero per variant is another obvious baseline.
- Steven Byrnes 2 Jun 2025 16:33 UTC
  2 points
  0
  Parent
  Hmm, you’re probably right.
  But I think my point would have worked if I had suggested a modified version of Go rather than chess?
  - RogerDearnaley 3 Jun 2025 1:22 UTC
    2 points
    0
    Parent
    There’s not a lot of scope for aligned/unaligned behavior in Go (or chess): it’s a zero-sum game, so I don’t see how any Go plays could be labeled as aligned or unaligned. How about some complex tactical or simulation game that actually has a scope for aligned/unaligned or at least moral/immoral behavior? Ideally one where you are roleplaying as an AI, so aligned behavior is appropriate, or at least doing some sort of resource management or strategy task that might get assigned to an AI.
    - Steven Byrnes 3 Jun 2025 10:50 UTC
      2 points
      0
      Parent
      I was trying to argue in favor of:
      CLAIM: there are AI capabilities things that cannot be done without RL training (or something functionally equivalent to RL training).
      It seems to me that, whether this claim is true or false, it has nothing to do with alignment, right?
      - RogerDearnaley 3 Jun 2025 21:15 UTC
        2 points
        0
        Parent
        There are certainly things that it’s easier to do with RL — whether it’s ever an absolute requirement I’m less sure. One other commenter has implied that someone has proven that RL always has non-RL equivalent alternatives, but if that’s the case I’m not familiar with the details — I’d love references to anything relevant to this, if anyone has them.
        
        My claim is that using RL to align an unaligned LLM smarter than us is likely to be impossible to do safely/reliably (and especially so for online RL), but that fortunately, aligning an LLM by pretraining or finetuning is possible, and logistically is not very different in difficulty from using offline RL. Functionally, it’s basically equivalent to offline RL plus a satisficing approach to the rating that keeps the behavior inside the training distribution so avoids Goodharting issues.
    - Gurkenglas 8 Jun 2025 10:29 UTC
      0 points
      0
      Parent
      Suppose we lived in a spatially-finite universe with simple deterministic laws of physics that we have fully colonized, in which we can run a computation for any finite number of steps that we can specify. (For example, everyone agrees to hibernate until it’s done.) Let’s use it to play Go.
      Run all ~2^2^33 programs (“contestants”) that fit in a gigabyte against each other from all ~3^19^2 possible positions. Delete all contestants that use more than 2^2^2^2^2^100 CPU cycles on any one move. For every position from which some contestant wins every match, delete every contestant that doesn’t win every match.
      This enforces ~perfect play. Is it safe to pick a surviving contestant pseudorandomly? Not clearly: Consider the following reasonably-common kind of contestant.
      Most of it is written in a custom programming language. This means it’ll also need to contain an interpreter for that language, but probably overall this is more efficient than working in whatever language we picked. As a side effect, it knows most of its source code C.
      Given input I, for each possible output O, it makes use of the logical consequences of “Source code C, given input I, produces output O.”. For example, it might return the O for which it can prove the most consequences.
      What logical consequences might it prove? “1=1” for one, but that will count towards every O. “Source code C, given input I, produces output O.” for another, but that’s a pretty long one. If it would be the survivor in line to be pseudorandomly selected, most consequences of its decision are via the effects on our universe!
      So if it predicts that it would be selected^[1], it will output perfect play to survive, and then keep being consequentialist about any choice between two winning strategies—for example, it might spell out a message if we would watch the winner play, or it could steer our experiment’s statistics to inspire a follow-up experiment that will, due to a bug, run forever rather than ever waking us up from hibernation.
      ^
      Or by some tweaking of 2., if it assumes that it would be selected because otherwise the choice of O doesn’t matter,