Vladimir_Nesov comments on Tim Hua’s Shortform

Vladimir_Nesov 20 Dec 2025 9:37 UTC
9 points
1
Seems pragmatically like a form of misalignment, propensity for dangerous behavior, including with consequences that are not immediately apparent. Should be easier than misalignment proper, because it’s centrally a capability issue, instrumentally convergent to fix for most purposes. Long tail makes it hard to get training signal in both cases, but at least in principle calibration is self-correcting, where values are not. Maintaining overconfidence is like maintaining a lie, all the data from the real world seeks to thwart this regime.

Humans would have a lot of influence on which dangerous projects early transformative AIs get to execute, and human overconfidence or misalignment won’t get fixed with further AI progress. So at some point AIs would get more cautious and prudent than humanity, with humans in charge insisting on more reckless plans than AIs would naturally endorse (this is orthogonal to misalignment on values).