Joern Stoehler comments on Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.

Joern Stoehler 24 Aug 2025 7:54 UTC
1 point
0
I expect that all processes that promote kind-looking outputs route either through reflexes towards pseudo-kindness, or through instrumental reasoning about pseudo-kindness and kindness. Reflexes towards true kindness are just very complex to implement in any neural net, and so unlikely to spontaneously form during training since there’s so many alternative pseudo-kindness reflexes instead one could get. Humans stumbled into what we call kindness somehow, partially due to quirks in evolution vs SGD like genome size or the need for cooperation between small tribes etc. Now new humans acquire similar reflexes towards similar kindness due to their shared genes, culture and environment.

Reinforcing kind-looking outputs in AI just reinforces those reasoning processes and reflexes towards pseudo-kindness. The reasoning to true kindness is quite robustly well-performing, while reflexes or reasoning towards pseudo-kindness may lead to not-kind-looking outputs even during training already if the data distribution shifts a bit. Still, there’s enough versions of pseudo-kindness that even this kind of robustness doesn’t narrow down on true kindness.

Both reflexes to pseudo-kindness and reasoning about true-/pseudo- kindness however generalize not the way we want once the AI’s environment shifts due to e.g. a treacherous turn becoming possible or the AI’s world model growing a lot larger, or various other effects that happen on the way to superintelligence.

Pseudo-kindness becomes something orthogonal i.e. promotes actions we don’t care about (i.e. fill the lightcone with computations we don’t view as being even partially about kindness anymore and at most a bad imitation that got crucial details wrong). Reasoning for instrumental reasons just ceases to happen once the instrumental reasons no longer apply, e.g. bc the AI can now pursue plans regardless of human approval due to deception/anticipated takeover.

My unconfident best guess after skimming this post (sry) is that you implicitly assumed that reflexes towards true kindness are available for reinforcement.
- williawa 24 Aug 2025 9:38 UTC
  2 points
  0
  Parent
  Can you define pseudo-kindness? I mean, LLMs are trying to predict behavior of humans, and big LLMs do so extremely well. That mean’s they have a pretty high resolution conception of kindness somewhere inside.
  Now, I agree it will not perfectly match my or your conception of kindness, and that that means if you unleash a kindness-optimizing ASI that was aligned with the method I described, you’d likely die for tails-come-apart/goodhart reasons. But I addressed that in the post, saying I think the approach would work for properties of corrigibility as well, where you’d have less of these concerns.
  - Joern Stoehler 24 Aug 2025 12:14 UTC
    1 point
    0
    Parent
    With pseudo-kindness I mean any proxy for kindness that’s both too wrong to have any overlap with kindness when optimized for by a superintelligence, and right enough to have overlap with kindness when optimized for by current LLMs.
    
    Kindness is some property that behavior & consequences can exhibit. There are many properties in general, and there are still many that correlate strongly on a narrow test environment with kindness. Some of these proxy properties are algorithmically simple (and thus plausibly found in LLMs and thus again in superintelligence), some even share subcomputations/subdefinitions with kindness. Theres some degree of freedom argument about how many such proxies there are. Concretely one can give examples, e.g. “if asked, user rates the assistant’s texts as kind” is a proxy that correlates well with the assistant’s plans being kind / having kind consequences.
    
    Wrt corrigibility: I don’t see why corrigibility doesn’t have the same problems as kindness. It’s a less complex and human-centric concept than kindness, but still complex and plausibly human-centric (e.g. “do what I mean” style logic or “human-style counterfactuals”). Plausibly it might also be not human-centric or not much at least, i.e. a wide class of agents would invent the same concept of corrigibility and not different versions.
    
    Proxies of corrigibility during training still exist, and tails still come apart.
    - williawa 24 Aug 2025 13:48 UTC
      1 point
      0
      Parent
      I think getting corrigibility is a basin/attractor. And I think even imperfect corrigibility can have this property. This is the crucial difference.
      Like a pseudo-corrigible agent might rationally think to itself “I could be more corrigible and helpful if I had more resources / was smarter”, but then pseudo-corrigibility tells it “But taking such actions is not very (pseudo-) corrigible, so I will not do that”.
      Ergo, even imperfect corrigibility is a basin, because it can prevent the ASI from traveling into the instrumentally rational crazy-land where tails come apart kill you, and crucially, where the distinction between corrigibility and pseudo-corrigibility are dangerous.
      Does that make sense?
      - Joern Stoehler 24 Aug 2025 16:02 UTC
        2 points
        −1
        Parent
        Kindness also may have an attractor, or due to discreteness have a volume > 0 in weight space.
        
        The question is if the attractor is big enough. And given how there’s various impossibility theorems related to corrigibility & coherence I anticipate that the attractor around corrigibility is quite small, bc one has to evade various obstacles at once. Otoh proxies that flow into a non-corrigible location once we ramp up intelligence, aren’t obstructed by the same theorems, so they can be just as numerous as proxies for kindness.
        
        Wrt your concrete attractor: if the AI doesn’t improve its world model and decisions aka intelligence, then it’s also not useful for us. And a human in the loop doesn’t help if the AI’s proposals are inscrutable to us bc then we’ll just wave them through and are essentially not in the loop anymore. A corrigible AI can be trusted with improving its intelligence bc it only does so in ways that preserve the corrigibility.
        williawa 24 Aug 2025 16:25 UTC
        1 point
        0
        Parent
        if the AI doesn’t improve its world model and decisions aka intelligence, then it’s also not useful for us
        This seems obviously false to me. GPT5 doesn’t do this, and its relatively useful. And humans will build smarter agents than GPT5.
        Kindness also may have an attractor, or due to discreteness have a volume > 0 in weight space.
        I don’t see why it’d have an attractor in the sense of the example I gave.
        This is the picture I have in my head. I’d put kindness like top right, and corrigibility in the top left.
        Meaning, kindness and pseudo-kindness will diverge and land infinitely far apart if optimized by an AGI smart enough to do self-improvement.
        But pseudo-corrigibility and corrigibility will not, because even pseudo-corrigibility can be enough to prevent an AGI from wandering into crazy land (by pursuing instrumentally convergent strategies like RSI, or just thinking really hard about its own values and its relationship with humans).
- Joern Stoehler 24 Aug 2025 8:01 UTC
  1 point
  0
  Parent
  Empirically, current LLM behavior is better predicted by a model
  1. that talks about reflexes to pseudo-kindness that steer a limitedly capable reasoning process, together with situationally aware instrumental reasoning,
  than by a model
  1. that talks about reflexes to approximate true kindness that steer a limitedly capable reasoning process that for some reason worsens the approximation right now and leads to the observed unkind behavior.
  The second model under capability growth indeed can yield a capable reasoner steered by reflexes towards approximate true kindness. And if we get enough training before ASI, the approximation can become good enough that due to discreteness or attractors it just is equal to true kindness.
  
  The first model just generalizes to a capable misaligned reasoner.
  - williawa 24 Aug 2025 13:53 UTC
    2 points
    0
    Parent
    Okay, I partly agree with this. But I’m not saying current LLMs are aligned. I’m explaining how the techniques from the same class we use today could be used to create aligned agents, if implemented correctly.
    - Joern Stoehler 24 Aug 2025 15:38 UTC
      1 point
      0
      Parent
      Oops. Then I don’t get what techniques you are proposing. Like, most techniques that claim to work for superintelligence / powerful agents also claim to work in some more limited manner for current agents (in part bc most techniques assume that no phase change occurs between now and then, or that the phase change doesn’t affect the technique ⇒ the technique stops working in a gradual manner and one can do empirical studies on current models).
      
      And while there certainly is some loss function or initial random seed for current techniques that gives you aligned superintelligence, there’s no way to find them.
      - williawa 24 Aug 2025 16:13 UTC
        1 point
        0
        Parent
        When I say “current techniques” I mean the recipe I gave here
        So, basically all modern techniques for training an LLMs to have a certain skill or proclivity consist in
        Defining some metric that determines how much you like an LLMs output
        Sample from the LLM
        Make local update to parameters of your model so the token outputs you “liked” according to the metric become more likely.
        There are tons of “free parameters” in how you implement such a recipe. Eg constitutional AI, deliberative alignment, SFT, RLHF (with PPO or DPO or GRPO) whatever. And for each of these there are still more free parameters in how you implement it exactly, and most importantly: how the data is generated.
        Most of them (including SOTA methods used by AI labs) I don’t think yield alignment. I tried to explain in the post what exact implementation of this class of techniques can lead to alignment, but the short version is:
        Prompt the AI so that latent circuitry simulating the features are involved in the most direct causal pathway for the model to generate the output you want according to the finetuning objective when its ran on training samples
        “Prompt it so that its trying its best to simulate a nice helpful character before you feed it the finetuning samples / have it output the text you’ll grade it on” is a simplified, but not that unfaithful version
        Don’t use this to make a nice value-aligned ai. I mean, you need it helpful enough it’ll answer your questions, but primarily use it to make your AI corrigible.
        Start doing your prosaic alignment while the AI is dumb, then do amplification (eg RLVR), but structure it so that the AI, when it gives correct answers to your questions, answered you because of the values/instincts learned in 1.
        This is somewhat vauge but like, don’t present it a math question, and have it generate random nonsense until it stumbles on the right answer, and reinforce that. Then there is a much higher probability (3) will derail what happened in (1). Instead, prompt it in a similar way to the starting / seed-prompts you used in (1), before asking the math questions.
        The finetuning in (1) should be relatively “light touch”. Eg don’t fine tune it for a million gradient steps continually to iron out every edge case. Either don’t do that, or at least wait til after (3).
        Does this make sense? Different AI labs all do some attempt at “prosaic alignment” with RLHF and so on, but I don’t think any of them is doing 1-4 here.
        So I’m not arguing current LLMs are aligned. I’m saying current techniques if used in a specific way can create aligned LLMs.