Adrià Garriga-alonso comments on Alignment will happen by default. What’s next?

Adrià Garriga-alonso 26 Nov 2025 6:34 UTC
LW: 2 AF: 1
0
AF

I think you’re getting a lot of pushback because models currently do pretty clearly sometimes not follow either user or developer intent

To be clear I’m glad for the pushback!! Thank you for engaging.

It’s pretty clear how it misinterpreted/was mistrained toward dev intent, and misinterpreted my intent based on that training.

Ouch. And you think it was genuine misinterpretation (so capabilities failure) or specification failure?

I really think nobody has much clue about this at this point, and we should collectively try to get one.

Yeah, I guess I agree there.

This video with Neel Nanda goes into the logic.

which one?
- Seth Herd 26 Nov 2025 14:50 UTC
  2 points
  0
  Parent
  I think they trained it in a way that crippled its capabilities in that area. This was probably the result of rushing and unforeseen consequences of what they chose focus on for training targets. So some of both.
  
  This video:
  
  https://www.lesswrong.com/posts/4mtqQKvmHpQJ4dgj7/daniel-tan-s-shortform?commentId=RD2BKixd9PbyDnBEc
  
  I got tired and forgot to find it and add the link before sending.