Seth Herd comments on (Seemingly) Aligned AI is the most dangerous AI

Seth Herd 22 Apr 2026 22:24 UTC
2 points
0
I think you mean that seemingly-aligned AI is the most dangerous kind? Big difference. This issue is discussed frequently, although I don’t have a really good reference for it.
- Shivam 23 Apr 2026 13:17 UTC
  1 point
  0
  Parent
  Yes and no. Yes in principle but no in practicality. The point I am making is that with regards to all possible measures, seemingly aligned and aligned models are indistinguishable. I agree that many people are aware and it has been talked about for a long time on this forum but I am surprised that evidence of apparent alignment is still considered as progress towards the later by a lot of people. Whereas I see later having a measurable evidence almost impossible.
  - Seth Herd 23 Apr 2026 15:17 UTC
    2 points
    1
    Parent
    I got that. I’m pointing out that you’re probably being downvoted because your title is quite inaccurate.
    - Shivam 23 Apr 2026 17:30 UTC
      1 point
      0
      Parent
      Thanks for pointing that out and for the engagement despite that. I have changed the title and added a short note on the edit.
      
      What really bothers me is the trajectory we seem to have chosen: Continue to scale the models and monitor them for misalignment. This plan has some obvious flaws:
      
      1. Verifiable way to know if we got an aligned or seemingly aligned AI as a result; since evaluations can’t distinguish between the two.
      
      2. White-box techniques seem to be pretty limited currently and it is uncertain if we will get distinguishable signals if most reliable techniques are developed past a certain(unknown) capability mark.
      
      3. If we continue moving towards more automated pipelines, because we feel it is safe, we won’t be able to limit catastrophes.
      
      I don’t see much push from technical perspective against this trajectory. For people starting out work in AI safety from technical perspective, I don’t see many suggestions that challenge this trajectory and propose alternatives. I see there is scientist AI from Yoshua Bengio but it doesn’t seem to be discussed as much. I see theoretical work with Simplex and SLT, but seems still at beginning stage.
      
      And since most beginner work is about replicating and extending current work, it creates this chain-reaction, and as a result, a majority of the work ends up being the one following the same trajectory.