This feels like a thing that shows up in evals before the AI takes catastrophic actions.
I guess the AI could be “aligned” in some sense but not corrigible / truthful, instrumentally hide its overconfidence, and then take catastrophic actions
This feels like a thing that shows up in evals before the AI takes catastrophic actions.
I guess the AI could be “aligned” in some sense but not corrigible / truthful, instrumentally hide its overconfidence, and then take catastrophic actions