I describe the more formal definition in the post:
‘Actions (or more generally ‘computations’) get an x-ness rating. We define the x shard’s expected utility conditional on a candidate action a as the sum of two utility functions: a bounded utility function on the x-ness of a and a more tightly bounded utility function on the expected aggregate x-ness of the agent’s future actions conditional on a. (So the shard will choose an action with mildly suboptimal x-ness if it gives a big boost to expected aggregate future x-ness, but refuse certain large sacrifices of present x-ness for big boosts to expected aggregate future x-ness.)′
And as I say in the post, we should expect decision-influences matching this definition to be natural and robust only in cases where x is a ‘self-promoting’ property. A property x is ‘self-promoting’ if it is reliably the case that performing an action with a higher x-ness rating increases the expected aggregate x-ness of future actions.
Yep! Or rather arguing that from a broadly RL-y + broadly Darwinian point of view ‘self-consistent ethics’ are likely to be natural enough that we can instill them, sticky enough to self-maintain, and capabilities-friendly enough to be practical and/or survive capabilities-optimization pressures in training.
This brings up something interesting: seems worthwhile to compare the internals of a ‘misgeneralizing,’ small n agent with those of large a n agents and check whether there seems to be a phase transition in how the network operates internally or not.
I’d maybe point the finger more at the simplicity of the training task than at the size of the network? I’m not sure there’s strong reason to believe the network is underparameterized for the training task. But I agree that drawing lessons from small-ish networks trained on simple tasks requires caution.
I would again suggest a ‘perceptual’ hypothesis regarding the subtraction/addition asymmetry. We’re adding a representation of a path where there was no representation of a path (creates illusion of path), or removing a representation of a path where there was no representation of a path (does nothing).
No but I hope to have a chance to try something like it this year!
The main reason is that different channels that each code cheese locations (e.g. channel 42, channel 88) seem to initiate computations that each encourage cheese-pursuit conditional on slightly different conditions. We can think of each of these channels as a perceptual gate to a slightly different conditionally cheese-pursuing computation.
Having a go at extracting some mechanistic claims from this post:
A value x is a policy-circuit, and this policy circuit may sometimes respond to a situation by constructing a plan-grader and a plan-search.
The policy-circuit executing value x is trained to construct <plan-grader, plan-search> pairs that are ‘good’ according to the value x, and this excludes pairs that are predictably going to result in the plan-search Goodharting the plan-grader.
Normally, nothing is trying to argmax value x’s goodness criterion for <plan-grader, plan-search> pairs. Value x’s goodness criterion for <plan-grader, plan-search> pairs is normally just implicit in x’s method for constructing <plan-grader, plan-search> pairs.
Value x may sometimes explicitly search over <plan-grader, plan-search> pairs in order to find pairs that score high according to a grader-proxy to value x’s goodness criterion. However, here too value x’s goodness criterion will be implicitly expressed in the policy-execution level as a disposition to construct a pair <grader-proxy to value x’s goodness criterion, search over pairs> that doesn’t Goodhart the grader-proxy to value x’s goodness criterion.
The crucial thing is that the true, top level ‘value x’s goodness criterion’ is a property of an actor, not a critic.
Here is a shard-theory intuition about humans, followed by an idea for an ML experiment that could proof-of-concept its application to RL: Let’s say I’m a guy who cares a lot about studying math well, studies math every evening, and doesn’t know much about drugs and their effects. Somebody hands me some ketamine and recommends that I take ketamine this evening. I take the ketamine before I sit down to study math, and math study goes terrible intellectually but since I am on ketamine I’m having a good time and credit gets assigned to the ‘taking ketamine before I sit down to study math’ computation. So my policy network gets updated to increase the probability of the computation ‘take ketamine before I sit down to study math.’
HOWEVER my world-model also gets updated, acquiring the new knowledge ‘taking ketamine before I sit down to study math makes math-study go terrible intellectually.’ And if I have a strong enough ‘math study’ value shard then in light of this new knowledge the ‘math study’ value shard is going to forbid taking ketamine before I sit down to study math. So my ‘take ketamine before sitting down to study math’ exploration resulted in me developing an overall disposition against taking ketamine before sitting down to study math, even though the computation ‘take ketamine before sitting down to study math’ was directly reinforced! (Because same act of exploration also resulted in a world-model update that associated the computation ‘take ketamine before sitting down to study math’ with implications that an already-powerful shard opposes.)
This is important, I think, because it shows that an agent can explore relatively freely without being super vulnerable to value-drift, and that you don’t necessarily need complicated reflective reasoning to have (at least very basic) anti-value-drift mechanisms. Since reinforcement is a pretty gradual thing, you can often try an action you don’t know much about, and if it turns out that this action has high reward but also direct implications that your already existing powerful shards oppose then the weak shard formed by that single reinforcement pass will be powerless.Now the ML experiment idea: A game where the agent gets rewarded for (e.g.) jumping high. After the agent gets somewhat trained, we continue training but introduce various ‘powerups’ the agent can pick up that increase or decrease the agent’s jumping capacity. We train a little more, and now we introduce (e.g.) green potions that decrease the agent’s jumping capacity but increase the reward multiplier (positive for expected reward on the balance).My weak hypothesis is that even though trying green potions gets a reinforcement event, the agent will avoid green potions after trying them. This is because there’d be a strong ‘avoid things that decrease jumping capacity’ shard already in place that will take charge once the agent learns to associate taking green potions with decrease in jumping capacity. (Though maybe it’s more complicated: maybe there will be a kind of race between ‘taking green potions’ getting reinforced and the association between taking green potions and decrease in jumping capacity forming and activating the ‘avoid things that decrease jumping capacity’ shard.)Another interesting question: what will happen if we introduce (e.g.) red potions that increase the agent’s jumping capacity but decrease the reward multiplier (negative for expected reward on the balance)? Seems clear that as the agent takes red potions over and over the reinforcement process will eventually remove the disposition to take red potions, but would this also start to push the agent towards forming some kind of mental representation of ‘reward’ to model what’s going on? If we introduce red potions first, then do some training, and then introduce green potions, would the experience with red potions make the agent respond differently (perhaps more like a reward maximiser) to trying green potions?