Gurkenglas comments on The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

Gurkenglas 8 Jun 2025 10:29 UTC
0 points
0
Suppose we lived in a spatially-finite universe with simple deterministic laws of physics that we have fully colonized, in which we can run a computation for any finite number of steps that we can specify. (For example, everyone agrees to hibernate until it’s done.) Let’s use it to play Go.
Run all ~2^2^33 programs (“contestants”) that fit in a gigabyte against each other from all ~3^19^2 possible positions. Delete all contestants that use more than 2^2^2^2^2^100 CPU cycles on any one move. For every position from which some contestant wins every match, delete every contestant that doesn’t win every match.
This enforces ~perfect play. Is it safe to pick a surviving contestant pseudorandomly? Not clearly: Consider the following reasonably-common kind of contestant.
1. Most of it is written in a custom programming language. This means it’ll also need to contain an interpreter for that language, but probably overall this is more efficient than working in whatever language we picked. As a side effect, it knows most of its source code C.
2. Given input I, for each possible output O, it makes use of the logical consequences of “Source code C, given input I, produces output O.”. For example, it might return the O for which it can prove the most consequences.
What logical consequences might it prove? “1=1” for one, but that will count towards every O. “Source code C, given input I, produces output O.” for another, but that’s a pretty long one. If it would be the survivor in line to be pseudorandomly selected, most consequences of its decision are via the effects on our universe!
So if it predicts that it would be selected^[1], it will output perfect play to survive, and then keep being consequentialist about any choice between two winning strategies—for example, it might spell out a message if we would watch the winner play, or it could steer our experiment’s statistics to inspire a follow-up experiment that will, due to a bug, run forever rather than ever waking us up from hibernation.
1. ^
  Or by some tweaking of 2., if it assumes that it would be selected because otherwise the choice of O doesn’t matter,