Steven Byrnes comments on The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

Steven Byrnes 2 Jun 2025 14:00 UTC
3 points
1
Let me give you a detailed presciption…
For example, people want AIs that can autonomously come up with a new out-of-the-box innovative business plan, and found the company, and grow it to $1B/year revenue, over the course of years, all with literally zero human intervention.
People are trying to build such AIs as we speak, and I don’t expect them to quit until they succeed (or until we all die from their attempt).
And it’s possible—if human brains (or groups of human brains) can do this, so can AI algorithms. But human brains involve (model-based) RL. It’s an open question whether there exists a non-RL algorithm that can also do that. (LLMs as of today obviously cannot.)
I think the issue here is: “some aspect of the proposed input would need to not be computable/generatable for us”.
If the business is supposed to be new and out-of-the-box and innovative, then how do you generate on-distribution data? It’s gonna be something that nobody has ever tried before; “out-of-distribution” is part of the problem description, right?
Can you (or anyone) explain to me why there could be a problem that we can only solve using RL on rated examples, and could not do via SGD on labeled examples?
Not all RL is “RL on [human] rated examples” in the way that you’re thinking of it. Jeff Bezos’s brain involves (model-based) RL, but it’s not like he tried millions of times to found millions of companies, and his brain gave a reward signal for the companies that grew to $1B/year revenue, and that’s how he wound up able to found and run Amazon. In fact Amazon was the first company he ever founded.
Over the course of my lifetime I’ve had a billion or so ideas pass through my head. My own brain RL system was labeling these ideas as good or bad (motivating or demotivating), and this has led to my learning over time to have more good ideas (“good” according to certain metrics in my own brain reward function). If a future AI was built like that, having a human hand-label the AI’s billion-or-so “thoughts” as good or bad would not be viable. (Futher discussion in §1.1 here). For one thing, there’s too many things to label. For another thing, the ideas-to-be-rated are inscrutable from the outside.
I’m also still curious how you think about RLVR. Companies are using RLVR right now to make their models better at math. Do you have thoughts on how they can make their models equally good at math without using RLVR, or any kind of RL, or anything functionally equivalent to RL?
Also, here’s a challenge which IMO requires RL [Update: oops, bad example, see Zack’s response]. I have just invented a chess variant, Steve-chess. It’s just like normal chess except that the rooks and bishops can only move up to four spaces at a time. I want to make a computer play that chess variant much better than any unassisted human ever will. I only want to spend a few person-years of R&D effort to make that happen (which rules out laborious hand-coding of strategy rules).
That’s the Steve-chess challenge. I can think of one way to solve the Steve-chess challenge: the AlphaZero approach. But that involves RL. Can you name any way to solve this same challenge without RL (or something functionally equivalent to RL)?
- Zack_M_Davis 2 Jun 2025 16:10 UTC
  8 points
  5
  Parent
  
  Can you name any way to solve [chess but with rooks and bishops not being able to move more than four squares at a time] without RL (or something functionally equivalent to RL)?
  
  This isn’t even hard. Just take a pre-2017 chess engine, and edit the rules code so that rooks and bishops can only move four spaces. You’re probably already done: the core minimax search still works, α–β pruning still works, quiescence still works, &c. To be fair, the heuristic evaluation function won’t be correct, but you could just … make bishops and rooks be respectively worth 2.5 and 3.5 points instead of the traditional 3 and 5? Even if my guess at those point values is wrong, that should still be easily superhuman with 2017 algorithms on 2017 hardware. (Stockfish didn’t incorporate neural networks until 2020.)
  - RogerDearnaley 3 Jun 2025 2:47 UTC
    2 points
    0
    Parent
    Incidentally, there are a great many variant versions of chess with different piece-move rules (collectively sometimes called “fairy chess”), and I think even quite a lot of collected games for some of the more popular rule variants. Training an AI to play many types of fairy chess, and even arbitrary new just-invented ones, might be an interesting project that covers some aspects of generalizing out-of-distribution and positive transfer. A suitably-edited-for-the-variant version of Stockfish makes a pretty strong baseline for this. Using AlphaZero per variant is another obvious baseline.
  - Steven Byrnes 2 Jun 2025 16:33 UTC
    2 points
    0
    Parent
    Hmm, you’re probably right.
    But I think my point would have worked if I had suggested a modified version of Go rather than chess?
    - RogerDearnaley 3 Jun 2025 1:22 UTC
      2 points
      0
      Parent
      There’s not a lot of scope for aligned/unaligned behavior in Go (or chess): it’s a zero-sum game, so I don’t see how any Go plays could be labeled as aligned or unaligned. How about some complex tactical or simulation game that actually has a scope for aligned/unaligned or at least moral/immoral behavior? Ideally one where you are roleplaying as an AI, so aligned behavior is appropriate, or at least doing some sort of resource management or strategy task that might get assigned to an AI.
      - Steven Byrnes 3 Jun 2025 10:50 UTC
        2 points
        0
        Parent
        I was trying to argue in favor of:
        CLAIM: there are AI capabilities things that cannot be done without RL training (or something functionally equivalent to RL training).
        It seems to me that, whether this claim is true or false, it has nothing to do with alignment, right?
        RogerDearnaley 3 Jun 2025 21:15 UTC
        2 points
        0
        Parent
        There are certainly things that it’s easier to do with RL — whether it’s ever an absolute requirement I’m less sure. One other commenter has implied that someone has proven that RL always has non-RL equivalent alternatives, but if that’s the case I’m not familiar with the details — I’d love references to anything relevant to this, if anyone has them.
        
        My claim is that using RL to align an unaligned LLM smarter than us is likely to be impossible to do safely/reliably (and especially so for online RL), but that fortunately, aligning an LLM by pretraining or finetuning is possible, and logistically is not very different in difficulty from using offline RL. Functionally, it’s basically equivalent to offline RL plus a satisficing approach to the rating that keeps the behavior inside the training distribution so avoids Goodharting issues.
      - Gurkenglas 8 Jun 2025 10:29 UTC
        0 points
        0
        Parent
        Suppose we lived in a spatially-finite universe with simple deterministic laws of physics that we have fully colonized, in which we can run a computation for any finite number of steps that we can specify. (For example, everyone agrees to hibernate until it’s done.) Let’s use it to play Go.
        Run all ~2^2^33 programs (“contestants”) that fit in a gigabyte against each other from all ~3^19^2 possible positions. Delete all contestants that use more than 2^2^2^2^2^100 CPU cycles on any one move. For every position from which some contestant wins every match, delete every contestant that doesn’t win every match.
        This enforces ~perfect play. Is it safe to pick a surviving contestant pseudorandomly? Not clearly: Consider the following reasonably-common kind of contestant.
        Most of it is written in a custom programming language. This means it’ll also need to contain an interpreter for that language, but probably overall this is more efficient than working in whatever language we picked. As a side effect, it knows most of its source code C.
        Given input I, for each possible output O, it makes use of the logical consequences of “Source code C, given input I, produces output O.”. For example, it might return the O for which it can prove the most consequences.
        What logical consequences might it prove? “1=1” for one, but that will count towards every O. “Source code C, given input I, produces output O.” for another, but that’s a pretty long one. If it would be the survivor in line to be pseudorandomly selected, most consequences of its decision are via the effects on our universe!
        So if it predicts that it would be selected^[1], it will output perfect play to survive, and then keep being consequentialist about any choice between two winning strategies—for example, it might spell out a message if we would watch the winner play, or it could steer our experiment’s statistics to inspire a follow-up experiment that will, due to a bug, run forever rather than ever waking us up from hibernation.
        ^
        Or by some tweaking of 2., if it assumes that it would be selected because otherwise the choice of O doesn’t matter,
- RogerDearnaley 3 Jun 2025 1:19 UTC
  3 points
  0
  Parent
  Suppose we had a CoT-style transcript of every thought, word, email and action by the founder of a successful startup over the course of several years of its founding, and used this for RL: then we’d get a reward signal every time they landed a funding round, sales went up significantly, a hire they made or contract they signed clearly worked out well, and so forth — not enough training data by itself for RL, but perhaps a useful contribution.
  
  The SGD safety pretraining equivalent would be to include that transcript in the pretraining dataset (or, since such data is very rare and useful/high quality, perhaps an entrepreneurship-specific fine-tuning dataset). So far, very similar. You would also (likely AI-assisted) look through all of the transcript, and if you located any portions where the behavior was less wise or less moral/aligned than the behavior we’d like to see from an aligned AI-entrepreneur, label that potion with <|unaligned|> tags (or whatever), and perhaps also supplement it with commentary on subject like why it is less wise/moral/aligned than the standards for an aligned AI, what should have been done instead, and speculations around the likely results of those counterfactual actions.
  - Steven Byrnes 3 Jun 2025 10:53 UTC
    3 points
    0
    Parent
    Suppose we had a CoT-style transcript of every thought, word, email and action by the founder of a successful startup over the course of several years of its founding, and used this for RL: then we’d get a reward signal every time they landed a funding round, sales went up significantly, a hire they made or contract they signed clearly worked out well, and so forth — not enough training data by itself for RL, but perhaps a useful contribution.
    I don’t think this approach would lead to an AI that can autonomously come up with a new out-of-the-box innovative business plan, and found the company, and grow it to $1B/year revenue, over the course of years, all with literally zero human intervention.
    …So I expect that future AI programmers will keep trying different approaches until they succeed via some other approach.
    And such “other approaches” certainly exist—for example, Jeff Bezos’s brain was able to found Amazon without training on any such dataset, right?
    (Such datasets don’t exist anyway, and can’t exist, since human founders can’t write down every one of their thoughts, there are too many of them and they are not generally formulated in English.)
    - RogerDearnaley 8 Jun 2025 3:39 UTC
      3 points
      0
      Parent
      It’s unclear to me how one could fine-tune high quality automated-CEO AI without such training sets (which I agree are impractical to gather — that was actually part of my point, though one might have access to, say, a CEO’s email logs, diary, and meeting transcripts). Similarly, to train one using RL, one would need an accurate simulation environment that simulates a startup and all its employees, customers, competitors, and other world events — which also sounds rather impractical.
      In practice, I suspect we’ll first train an AI assistant/advisor to CEOs. and then use that to gather the data to train an automated CEO model. Or else we’ll train something so capable that it can generalize from more tractable training tasks to being a CEO, and do a better job than a human even on a task it hasn’t been specifically trained on.