ryan_greenblatt comments on ryan_greenblatt’s Shortform

ryan_greenblatt 8 May 2026 18:15 UTC
LW: 33 AF: 14
10
AF
Amusingly, just after I posted this, Anthropic released “Teaching Claude why” which has a bunch of information on how they behaviorally align their AIs (or at least how they iterate on particular behaviors/properties). (Though this doesn’t seem close to sufficient for a reasonably complete understanding.)
- Ben Livengood 9 May 2026 3:10 UTC
  3 points
  0
  Parent
  LLM-agents are sorta good at tasks when given a lot of examples of those tasks, including in some OOD related tasks. Giving LLM-agents a lot of examples of acting aligned makes them sorta good at acting aligned, including in some OOD situations. Treating alignment as a task is partially effective, is my takeaway. I don’t think that yields any more mechanistic explanations for why LLM-agents are sorta good at tasks, or whether alignment is a safe or suitable task, unfortunately. Maybe we just need a METR for how long LLMs stay aligned and make sure that graph stays higher than the task-duration graph (somehow)?
  - Haiku 9 May 2026 23:13 UTC
    1 point
    0
    Parent
    That sounds appealing if almost nothing ever changes, but we should expect discontinuities due to each or any of:
    
    The emergence of a new architecture / training regime
    Gaining the ability to self-modify
    The results of self-modification
    Gaining the ability to permanently elude oversight
    The consequences of acting at length with no oversight
    Gaining the ability to disempower humanity
    Achieving a deep and unrecognizable understanding of any one of many concepts important to humans (ontological shift)
    
    We just don’t understand any of this stuff. Even a perfectly benevolent superintelligent mind may be one digital prion from wiping us out, and it would be great to not get into that situation in the first place.