Oliver Daniels comments on Tim Hua’s Shortform

Oliver Daniels 20 Dec 2025 18:33 UTC
1 point
−2
This feels like a thing that shows up in evals before the AI takes catastrophic actions.
I guess the AI could be “aligned” in some sense but not corrigible / truthful, instrumentally hide its overconfidence, and then take catastrophic actions