Joern Stoehler comments on Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.

Joern Stoehler 24 Aug 2025 8:01 UTC
1 point
0
Empirically, current LLM behavior is better predicted by a model
1. that talks about reflexes to pseudo-kindness that steer a limitedly capable reasoning process, together with situationally aware instrumental reasoning,
than by a model
1. that talks about reflexes to approximate true kindness that steer a limitedly capable reasoning process that for some reason worsens the approximation right now and leads to the observed unkind behavior.
The second model under capability growth indeed can yield a capable reasoner steered by reflexes towards approximate true kindness. And if we get enough training before ASI, the approximation can become good enough that due to discreteness or attractors it just is equal to true kindness.

The first model just generalizes to a capable misaligned reasoner.
- williawa 24 Aug 2025 13:53 UTC
  2 points
  0
  Parent
  Okay, I partly agree with this. But I’m not saying current LLMs are aligned. I’m explaining how the techniques from the same class we use today could be used to create aligned agents, if implemented correctly.
  - Joern Stoehler 24 Aug 2025 15:38 UTC
    1 point
    0
    Parent
    Oops. Then I don’t get what techniques you are proposing. Like, most techniques that claim to work for superintelligence / powerful agents also claim to work in some more limited manner for current agents (in part bc most techniques assume that no phase change occurs between now and then, or that the phase change doesn’t affect the technique ⇒ the technique stops working in a gradual manner and one can do empirical studies on current models).
    
    And while there certainly is some loss function or initial random seed for current techniques that gives you aligned superintelligence, there’s no way to find them.
    - williawa 24 Aug 2025 16:13 UTC
      1 point
      0
      Parent
      When I say “current techniques” I mean the recipe I gave here
      So, basically all modern techniques for training an LLMs to have a certain skill or proclivity consist in
      Defining some metric that determines how much you like an LLMs output
      Sample from the LLM
      Make local update to parameters of your model so the token outputs you “liked” according to the metric become more likely.
      There are tons of “free parameters” in how you implement such a recipe. Eg constitutional AI, deliberative alignment, SFT, RLHF (with PPO or DPO or GRPO) whatever. And for each of these there are still more free parameters in how you implement it exactly, and most importantly: how the data is generated.
      Most of them (including SOTA methods used by AI labs) I don’t think yield alignment. I tried to explain in the post what exact implementation of this class of techniques can lead to alignment, but the short version is:
      Prompt the AI so that latent circuitry simulating the features are involved in the most direct causal pathway for the model to generate the output you want according to the finetuning objective when its ran on training samples
      “Prompt it so that its trying its best to simulate a nice helpful character before you feed it the finetuning samples / have it output the text you’ll grade it on” is a simplified, but not that unfaithful version
      Don’t use this to make a nice value-aligned ai. I mean, you need it helpful enough it’ll answer your questions, but primarily use it to make your AI corrigible.
      Start doing your prosaic alignment while the AI is dumb, then do amplification (eg RLVR), but structure it so that the AI, when it gives correct answers to your questions, answered you because of the values/instincts learned in 1.
      This is somewhat vauge but like, don’t present it a math question, and have it generate random nonsense until it stumbles on the right answer, and reinforce that. Then there is a much higher probability (3) will derail what happened in (1). Instead, prompt it in a similar way to the starting / seed-prompts you used in (1), before asking the math questions.
      The finetuning in (1) should be relatively “light touch”. Eg don’t fine tune it for a million gradient steps continually to iron out every edge case. Either don’t do that, or at least wait til after (3).
      Does this make sense? Different AI labs all do some attempt at “prosaic alignment” with RLHF and so on, but I don’t think any of them is doing 1-4 here.
      So I’m not arguing current LLMs are aligned. I’m saying current techniques if used in a specific way can create aligned LLMs.