This is not what normally happens with RL reward functions! For example, you might be wondering: “Suppose I surreptitiously[2] press a reward button when I notice my robot following rules. Wouldn’t that likewise lead to my robot having a proud, self-reflective, ego-syntonic sense that rule-following is good?” I claim the answer is: no, it would lead to something more like an object-level “desire to be noticed following the rules”, with a sociopathic, deceptive, ruthless undercurrent.[3]
I don’t think we have considered how much increased self-awareness and self-modelling would affect this. A simpler self model is where something is what it appears to be. That is actually being good rather than looking good.
A third option (as opposed to the two mentioned) is where power seeking is not a consequence of goals etc but simply the self wanting to continue to exist. Then the internal reward the creature has relates to how much it perceives its self to continue, improve etc.
Our current LLM/transformers don’t learn fast, so they also can’t self model well. If a new architecture gets more “data efficient” and better at modelling the external world, that will very likely make it better at modelling itself also, and updating its self model in a timely manner. If one of its goals is a more accurate model of itself, that would make it easier for others to also model it if such a goal pushed its “self” towards being more modellable.
Thanks! I have updated the article briefly with my thoughts on what has happened since also.