I’m also fairly worried about the case where the initial attempts at “hit them over the head with RL until they learn to be very biased against accurate assessments of risk from scaling capabilities” actually work.
Then is there any difference between your case and the one described in the AI-2027 goals forecast?
I’m also fairly worried about the case where the initial attempts at “hit them over the head with RL until they learn to be very biased against accurate assessments of risk from scaling capabilities” actually work.
Then is there any difference between your case and the one described in the AI-2027 goals forecast?