To some extent, but keep in mind that in another sense, the behavioural objective of maximising paperclips is totally consistent with playing along with the base objective for a while and then defecting. So I’m not sure the behaviour/mesa- distinction alone does the work you want it to do even in that case.
To some extent, but keep in mind that in another sense, the behavioural objective of maximising paperclips is totally consistent with playing along with the base objective for a while and then defecting. So I’m not sure the behaviour/mesa- distinction alone does the work you want it to do even in that case.
Agreed (haven’t thought about that).