Thomas Kwa comments on The Facade of AI Safety Will Crumble

Thomas Kwa 12 Feb 2026 21:39 UTC
6 points
0
The outcome-steering effect of future AIs will lose any discernable “flavor” or “personality” leaking from any details of its implementation that used to relate to why the implementation wasn’t robustly superhuman. Even attempts to probe the AI’s “mechanistically interpretable conditional behavior” will offer no predictive power beyond modeling the AI as an outcome-steering engine.
This isn’t a good argument. For one thing, humans still have personalities if they could think at 1000x speed. Now it’s true that Stockfish has less personality than human chess players because it’s optimized for the single goal of winning a chess game. But even if we assume superintelligences have fixed goals like this, conditional behavior will still clue us clues to what their goals are, just like if we ran a version of Stockfish that wanted to never lose its queen, we would observe it to be extremely protective of its queen.
If alignment science keeps up with capability advances, observing whether GPT-8 prefers to save birds or humans could give us more help in steering it because we know the goals are consistent rather than quirks. There will be challenges in keeping up, both in understanding GPT-8 and getting it to do what we want, but OP hasn’t explained at all how these relate to dealing with a “level separation”.
- Liron 14 Feb 2026 3:30 UTC
  5 points
  0
  Parent
  Humans at 1000x speed still retain many properties of immature goal engines, full of abstraction-breaking silly quirks, the same way ENIAC at 1000x speed can still get a literal bug (like a moth) in it. The direction of progress after ENIAC did not point toward ENIAC at 1000x speed.
  P.S. Thanks for being the only one so far to engage with my claim.