habryka comments on Igor Ivanov’s Shortform

habryka 6 Feb 2026 21:35 UTC
5 points
2
I expect misalignment rates to be locally linear in intelligence [1]
I disagree! I think treacherous turn things will generally mean it’s very hard to measure misalignment rates before capabilities exceed pivotal thresholds. I honestly have no idea how I would measure misalignment in current models at all (and think current “misalignment” measures tend to be if anything anticorrelated with attributes I care about). I do think this will get a bit easier around critical thresholds, but there is definitely a decent chance we will cross a critical threshold in a single model release.
That said, in many worlds it is still local, and it’s better than nothing to check for local stuff. I just don’t think it’s a good thing to put tons of weight on.
- Thomas Kwa 7 Feb 2026 1:52 UTC
  4 points
  0
  Parent
  I’d rather not get into the treacherous turn thing, but I agree there are lots of problems with the current state of alignment evals.