A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as “traps”, since they hinge of the non-asymptotic properties of generalization, i.e. on the “prior” or “inductive bias”. Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)
A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as “traps”, since they hinge of the non-asymptotic properties of generalization, i.e. on the “prior” or “inductive bias”. Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)