williawa comments on Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.

williawa 24 Aug 2025 13:53 UTC
2 points
0
Okay, I partly agree with this. But I’m not saying current LLMs are aligned. I’m explaining how the techniques from the same class we use today could be used to create aligned agents, if implemented correctly.
- Joern Stoehler 24 Aug 2025 15:38 UTC
  1 point
  0
  Parent
  Oops. Then I don’t get what techniques you are proposing. Like, most techniques that claim to work for superintelligence / powerful agents also claim to work in some more limited manner for current agents (in part bc most techniques assume that no phase change occurs between now and then, or that the phase change doesn’t affect the technique ⇒ the technique stops working in a gradual manner and one can do empirical studies on current models).
  
  And while there certainly is some loss function or initial random seed for current techniques that gives you aligned superintelligence, there’s no way to find them.
  - williawa 24 Aug 2025 16:13 UTC
    1 point
    0
    Parent
    When I say “current techniques” I mean the recipe I gave here
    So, basically all modern techniques for training an LLMs to have a certain skill or proclivity consist in
    Defining some metric that determines how much you like an LLMs output
    Sample from the LLM
    Make local update to parameters of your model so the token outputs you “liked” according to the metric become more likely.
    There are tons of “free parameters” in how you implement such a recipe. Eg constitutional AI, deliberative alignment, SFT, RLHF (with PPO or DPO or GRPO) whatever. And for each of these there are still more free parameters in how you implement it exactly, and most importantly: how the data is generated.
    Most of them (including SOTA methods used by AI labs) I don’t think yield alignment. I tried to explain in the post what exact implementation of this class of techniques can lead to alignment, but the short version is:
    Prompt the AI so that latent circuitry simulating the features are involved in the most direct causal pathway for the model to generate the output you want according to the finetuning objective when its ran on training samples
    “Prompt it so that its trying its best to simulate a nice helpful character before you feed it the finetuning samples / have it output the text you’ll grade it on” is a simplified, but not that unfaithful version
    Don’t use this to make a nice value-aligned ai. I mean, you need it helpful enough it’ll answer your questions, but primarily use it to make your AI corrigible.
    Start doing your prosaic alignment while the AI is dumb, then do amplification (eg RLVR), but structure it so that the AI, when it gives correct answers to your questions, answered you because of the values/instincts learned in 1.
    This is somewhat vauge but like, don’t present it a math question, and have it generate random nonsense until it stumbles on the right answer, and reinforce that. Then there is a much higher probability (3) will derail what happened in (1). Instead, prompt it in a similar way to the starting / seed-prompts you used in (1), before asking the math questions.
    The finetuning in (1) should be relatively “light touch”. Eg don’t fine tune it for a million gradient steps continually to iron out every edge case. Either don’t do that, or at least wait til after (3).
    Does this make sense? Different AI labs all do some attempt at “prosaic alignment” with RLHF and so on, but I don’t think any of them is doing 1-4 here.
    So I’m not arguing current LLMs are aligned. I’m saying current techniques if used in a specific way can create aligned LLMs.