I think it’s confusing because we mostly care about outcome “we mistakenly think that system is aligned, deploy it and get killed”, not about particular mechanism of getting this outcome.
Dumb example: let’s suppose that we train systems to report its own activity. Human raters consistently assign higher reward for more polite reports. At the end, system learns to produce so polite and smooth reports that human raters have hard time to catch any signs of misalignement in reports and take it for aligned system.
We have, on the one hand, system that superhumanly good at producing impression of being aligned, on the other hand, it’s not like it’s very strategically aware.
I think it’s confusing because we mostly care about outcome “we mistakenly think that system is aligned, deploy it and get killed”, not about particular mechanism of getting this outcome.
Dumb example: let’s suppose that we train systems to report its own activity. Human raters consistently assign higher reward for more polite reports. At the end, system learns to produce so polite and smooth reports that human raters have hard time to catch any signs of misalignement in reports and take it for aligned system.
We have, on the one hand, system that superhumanly good at producing impression of being aligned, on the other hand, it’s not like it’s very strategically aware.