One thing I forgot to mention is that there are reasons to expect “we are likely to build smart consequentialist (that e.g. max sum_t V_t0(s_t))” to be true that are stronger than “look at current AIs” / “this is roughly aligned with commercial incentives”, such as then ones described by Evan Hubinger here.
TL;DR: alignment faking may be more sample efficient / easier to learn / more efficient at making loss go down than internalizing what humans want, so AIs that fake alignment may be selected for.
Carlsmith is a good review of the “Will AIs be be smart consequentialists?” arguments up to late 2023. I think the conversation has progressed a little since then but not massively.
One thing I forgot to mention is that there are reasons to expect “we are likely to build smart consequentialist (that e.g. max sum_t V_t0(s_t))” to be true that are stronger than “look at current AIs” / “this is roughly aligned with commercial incentives”, such as then ones described by Evan Hubinger here.
TL;DR: alignment faking may be more sample efficient / easier to learn / more efficient at making loss go down than internalizing what humans want, so AIs that fake alignment may be selected for.
Carlsmith is a good review of the “Will AIs be be smart consequentialists?” arguments up to late 2023. I think the conversation has progressed a little since then but not massively.