RogerDearnaley comments on The corrigibility basin of attraction is a misleading gloss

RogerDearnaley 30 Jan 2026 23:02 UTC
2 points
0
What kind of legible architecture would be enough to give me optimism? The most bare-bones would be interpretability into the beliefs and desires of an AI, and the structural knowledge to verify that those beliefs and desires are the true beliefs and desires of the AI.
Yes — exactly. What makes you think people aren’t already working on it?