Joern Stoehler comments on Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.

Joern Stoehler 24 Aug 2025 12:14 UTC
1 point
0
With pseudo-kindness I mean any proxy for kindness that’s both too wrong to have any overlap with kindness when optimized for by a superintelligence, and right enough to have overlap with kindness when optimized for by current LLMs.

Kindness is some property that behavior & consequences can exhibit. There are many properties in general, and there are still many that correlate strongly on a narrow test environment with kindness. Some of these proxy properties are algorithmically simple (and thus plausibly found in LLMs and thus again in superintelligence), some even share subcomputations/subdefinitions with kindness. Theres some degree of freedom argument about how many such proxies there are. Concretely one can give examples, e.g. “if asked, user rates the assistant’s texts as kind” is a proxy that correlates well with the assistant’s plans being kind / having kind consequences.

Wrt corrigibility: I don’t see why corrigibility doesn’t have the same problems as kindness. It’s a less complex and human-centric concept than kindness, but still complex and plausibly human-centric (e.g. “do what I mean” style logic or “human-style counterfactuals”). Plausibly it might also be not human-centric or not much at least, i.e. a wide class of agents would invent the same concept of corrigibility and not different versions.

Proxies of corrigibility during training still exist, and tails still come apart.
- williawa 24 Aug 2025 13:48 UTC
  1 point
  0
  Parent
  I think getting corrigibility is a basin/attractor. And I think even imperfect corrigibility can have this property. This is the crucial difference.
  Like a pseudo-corrigible agent might rationally think to itself “I could be more corrigible and helpful if I had more resources / was smarter”, but then pseudo-corrigibility tells it “But taking such actions is not very (pseudo-) corrigible, so I will not do that”.
  Ergo, even imperfect corrigibility is a basin, because it can prevent the ASI from traveling into the instrumentally rational crazy-land where tails come apart kill you, and crucially, where the distinction between corrigibility and pseudo-corrigibility are dangerous.
  Does that make sense?
  - Joern Stoehler 24 Aug 2025 16:02 UTC
    2 points
    −1
    Parent
    Kindness also may have an attractor, or due to discreteness have a volume > 0 in weight space.
    
    The question is if the attractor is big enough. And given how there’s various impossibility theorems related to corrigibility & coherence I anticipate that the attractor around corrigibility is quite small, bc one has to evade various obstacles at once. Otoh proxies that flow into a non-corrigible location once we ramp up intelligence, aren’t obstructed by the same theorems, so they can be just as numerous as proxies for kindness.
    
    Wrt your concrete attractor: if the AI doesn’t improve its world model and decisions aka intelligence, then it’s also not useful for us. And a human in the loop doesn’t help if the AI’s proposals are inscrutable to us bc then we’ll just wave them through and are essentially not in the loop anymore. A corrigible AI can be trusted with improving its intelligence bc it only does so in ways that preserve the corrigibility.
    - williawa 24 Aug 2025 16:25 UTC
      1 point
      0
      Parent
      if the AI doesn’t improve its world model and decisions aka intelligence, then it’s also not useful for us
      This seems obviously false to me. GPT5 doesn’t do this, and its relatively useful. And humans will build smarter agents than GPT5.
      Kindness also may have an attractor, or due to discreteness have a volume > 0 in weight space.
      I don’t see why it’d have an attractor in the sense of the example I gave.
      This is the picture I have in my head. I’d put kindness like top right, and corrigibility in the top left.
      Meaning, kindness and pseudo-kindness will diverge and land infinitely far apart if optimized by an AGI smart enough to do self-improvement.
      But pseudo-corrigibility and corrigibility will not, because even pseudo-corrigibility can be enough to prevent an AGI from wandering into crazy land (by pursuing instrumentally convergent strategies like RSI, or just thinking really hard about its own values and its relationship with humans).