Here is a shard-theory intuition about humans, followed by an idea for an ML experiment that could proof-of-concept its application to RL:

Let’s say I’m a guy who cares a lot about studying math well, studies math every evening, and doesn’t know much about drugs and their effects. Somebody hands me some ketamine and recommends that I take ketamine this evening. I take the ketamine before I sit down to study math, and math study goes terrible intellectually but since I am on ketamine I’m having a good time and credit gets assigned to the ‘taking ketamine before I sit down to study math’ computation. So my policy network gets updated to increase the probability of the computation ‘take ketamine before I sit down to study math.’

HOWEVER my world-model also gets updated, acquiring the new knowledge ‘taking ketamine before I sit down to study math makes math-study go terrible intellectually.’ And if I have a strong enough ‘math study’ value shard then in light of this new knowledge the ‘math study’ value shard is going to forbid taking ketamine before I sit down to study math. So my ‘take ketamine before sitting down to study math’ exploration resulted in me developing an overall disposition **against** taking ketamine before sitting down to study math, even though the computation ‘take ketamine before sitting down to study math’ was directly **reinforced**! (Because same act of exploration also resulted in a world-model update that associated the computation ‘take ketamine before sitting down to study math’ with implications that an already-powerful shard opposes.)

This is important, I think, because it shows that an agent can explore relatively freely without being super vulnerable to value-drift, and that you don’t necessarily need complicated reflective reasoning to have (at least very basic) anti-value-drift mechanisms. Since reinforcement is a pretty gradual thing, you can often try an action you don’t know much about, and if it turns out that this action has high reward but also direct implications that your already existing powerful shards oppose then the weak shard formed by that single reinforcement pass will be powerless.

Now the ML experiment idea:

A game where the agent gets rewarded for (e.g.) jumping high. After the agent gets somewhat trained, we continue training but introduce various ‘powerups’ the agent can pick up that increase or decrease the agent’s jumping capacity. We train a little more, and now we introduce (e.g.) green potions that decrease the agent’s jumping capacity but increase the reward multiplier (positive for expected reward on the balance).

My weak hypothesis is that even though trying green potions gets a reinforcement event, the agent will avoid green potions after trying them. This is because there’d be a strong ‘avoid things that decrease jumping capacity’ shard already in place that will take charge once the agent learns to associate taking green potions with decrease in jumping capacity. (Though maybe it’s more complicated: maybe there will be a kind of race between ‘taking green potions’ getting reinforced and the association between taking green potions and decrease in jumping capacity forming and activating the ‘avoid things that decrease jumping capacity’ shard.)

Another interesting question: what will happen if we introduce (e.g.) red potions that increase the agent’s jumping capacity but decrease the reward multiplier (negative for expected reward on the balance)? Seems clear that as the agent takes red potions over and over the reinforcement process will eventually remove the disposition to take red potions, but would this also start to push the agent towards forming some kind of mental representation of ‘reward’ to model what’s going on? If we introduce red potions first, then do some training, and then introduce green potions, would the experience with red potions make the agent respond differently (perhaps more like a reward maximiser) to trying green potions?

Having a go at extracting some mechanistic claims from this post:

A value x is a policy-circuit, and this policy circuit may sometimes respond to a situation by constructing a plan-grader and a plan-search.

The policy-circuit executing value x is trained to construct <plan-grader, plan-search> pairs that are ‘good’ according to the value x, and this excludes pairs that are predictably going to result in the plan-search Goodharting the plan-grader.

Normally, nothing is trying to argmax value x’s goodness criterion for <plan-grader, plan-search> pairs. Value x’s goodness criterion for <plan-grader, plan-search> pairs is normally just implicit in x’s method for constructing <plan-grader, plan-search> pairs.

Value x may

sometimesexplicitly search over <plan-grader, plan-search> pairs in order to find pairs that score high according to a grader-proxy to value x’s goodness criterion.However, here too value x’s goodness criterion will be implicitly expressed in the policy-execution level as a disposition to construct a pair <grader-proxy to value x’s goodness criterion, search over pairs> that doesn’t Goodhart the grader-proxy to value x’s goodness criterion.The crucial thing is that the true, top level ‘value x’s goodness criterion’ is a property of an actor, not a critic.