ryan_greenblatt comments on ryan_greenblatt’s Shortform

ryan_greenblatt 8 May 2026 17:28 UTC
LW: 77 AF: 31
28
AF
We don’t know how AIs are aligned.

A somewhat crazy aspect of the current situation is that we have very little confirmed public information about why frontier AIs end up being apparently behaviorally aligned. And more generally, we don’t know what factors in training are most relevant for (behavioral) alignment. Like, what interventions in training result in Anthropic AIs following the constitution or make OpenAI AIs follow the spec? What factors tend to make them follow the constitution/spec less (or cause various specific misaligned behaviors)? It’s not that hard to get an OK sense of what is roughly going on based on speculation, rumors, and non-public info, but this situation results in a much worse public understanding of alignment. I think AI companies should be much more transparent about this. (It’s presumably not in their commercial interest to do this unilaterally, and it’s not obvious that an idealized altruistic AI company should unilaterally release this information if they couldn’t get other AI companies to do the same.)
What links here?
- The current bottleneck is political will, not research by Charbel-Raphaël (11 Jul 2026 21:56 UTC; 301 points)
- ryan_greenblatt 8 May 2026 18:15 UTC
  LW: 34 AF: 15
  10
  AF Parent
  Amusingly, just after I posted this, Anthropic released “Teaching Claude why” which has a bunch of information on how they behaviorally align their AIs (or at least how they iterate on particular behaviors/properties). (Though this doesn’t seem close to sufficient for a reasonably complete understanding.)
  - Ben Livengood 9 May 2026 3:10 UTC
    3 points
    0
    Parent
    LLM-agents are sorta good at tasks when given a lot of examples of those tasks, including in some OOD related tasks. Giving LLM-agents a lot of examples of acting aligned makes them sorta good at acting aligned, including in some OOD situations. Treating alignment as a task is partially effective, is my takeaway. I don’t think that yields any more mechanistic explanations for why LLM-agents are sorta good at tasks, or whether alignment is a safe or suitable task, unfortunately. Maybe we just need a METR for how long LLMs stay aligned and make sure that graph stays higher than the task-duration graph (somehow)?
    - Haiku 9 May 2026 23:13 UTC
      1 point
      0
      Parent
      That sounds appealing if almost nothing ever changes, but we should expect discontinuities due to each or any of:
      
      The emergence of a new architecture / training regime
      Gaining the ability to self-modify
      The results of self-modification
      Gaining the ability to permanently elude oversight
      The consequences of acting at length with no oversight
      Gaining the ability to disempower humanity
      Achieving a deep and unrecognizable understanding of any one of many concepts important to humans (ontological shift)
      
      We just don’t understand any of this stuff. Even a perfectly benevolent superintelligent mind may be one digital prion from wiping us out, and it would be great to not get into that situation in the first place.
- Matrice Jacobine 10 May 2026 14:13 UTC
  3 points
  0
  Parent
  It sure would be great if there was a well-funded “AI safety” nonprofit with the goal to be as open as possible about AI research. Anyone considered that?

ryan_greenblatt comments on ryan_greenblatt’s Shortform

We don’t know how AIs are aligned.