Davidmanheim comments on “The Urgency of Interpretability” (Dario Amodei)

Davidmanheim 27 Apr 2025 7:33 UTC
10 points
3
Quick take: it’s focused on interpretability as a way to solve prosaic alignment, ignoring the fact that prosaic alignment is clearly not scalable to the types of systems they are actively planning to build. (And it seems to actively embrace the fact that interpretability is a capabilities advantage in the short term, but pretends that it is a safety thing, as if the two are not at odds with each other when engaged in racing dynamics.)
What links here?
- AI #115: The Evil Applications Division by Zvi (8 May 2025 13:40 UTC; 32 points)
- MichaelDickens 28 Apr 2025 14:11 UTC
  1 point
  0
  Parent
  
  prosaic alignment is clearly not scalable to the types of systems they are actively planning to build
  
  Why do you believe this?
  
  (FWIW I think it’s foolish that all (?) frontier companies are all-in on prosaic alignment, but I am not convinced that it “clearly” won’t work.)
  - Davidmanheim 28 Apr 2025 21:30 UTC
    3 points
    0
    Parent
    Because they are all planning to build agents that will have optimization pressures, and RL-type failures apply when you build RL systems, even if it’s on top of LLMs.