Seth Herd comments on (Seemingly) Aligned AI is the most dangerous AI

Seth Herd 23 Apr 2026 15:17 UTC
2 points
1
I got that. I’m pointing out that you’re probably being downvoted because your title is quite inaccurate.
- Shivam 23 Apr 2026 17:30 UTC
  1 point
  0
  Parent
  Thanks for pointing that out and for the engagement despite that. I have changed the title and added a short note on the edit.
  
  What really bothers me is the trajectory we seem to have chosen: Continue to scale the models and monitor them for misalignment. This plan has some obvious flaws:
  
  1. Verifiable way to know if we got an aligned or seemingly aligned AI as a result; since evaluations can’t distinguish between the two.
  
  2. White-box techniques seem to be pretty limited currently and it is uncertain if we will get distinguishable signals if most reliable techniques are developed past a certain(unknown) capability mark.
  
  3. If we continue moving towards more automated pipelines, because we feel it is safe, we won’t be able to limit catastrophes.
  
  I don’t see much push from technical perspective against this trajectory. For people starting out work in AI safety from technical perspective, I don’t see many suggestions that challenge this trajectory and propose alternatives. I see there is scientist AI from Yoshua Bengio but it doesn’t seem to be discussed as much. I see theoretical work with Simplex and SLT, but seems still at beginning stage.
  
  And since most beginner work is about replicating and extending current work, it creates this chain-reaction, and as a result, a majority of the work ends up being the one following the same trajectory.