I prepared a response to your ideas in a separate post. I am not sure that the actual reason for the humans not to become schemers is due to interpretability-based Approval Reward itself and not some combination of the humans starting with an Approval Reward-like primitive circuitry like valuing smiles and learning to genuinely cooperate with others by having similar capabilities (e.g. due to being embodied). As for the field of Reward Function Design, I suspect it to require not interpretability, but something like Max Harms’ CAST agenda.
I prepared a response to your ideas in a separate post. I am not sure that the actual reason for the humans not to become schemers is due to interpretability-based Approval Reward itself and not some combination of the humans starting with an Approval Reward-like primitive circuitry like valuing smiles and learning to genuinely cooperate with others by having similar capabilities (e.g. due to being embodied). As for the field of Reward Function Design, I suspect it to require not interpretability, but something like Max Harms’ CAST agenda.