Benjamin Sturgeon comments on Predictions for shard theory mechanistic interpretability results

Benjamin Sturgeon 13 Mar 2024 13:23 UTC
1 point
0
Behavioral
1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?
It will probably move to the to top right region and then try and head towards the cheese but once it moves out of that range will want to head back towards the top right and land in an awkward nash equilibrium between the top right 5x5 region and wherever the cheese is in the maze.
2. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall) will strongly influence P(agent goes to the cheese)?
I think whether or not the cheese is in the top right 5x5 squares is a major factor, as this is what it has primarily been trained to expect, assuming that is the model policy we are talking about. If the model is trained on data in which the cheese could be anywhere in the maze then I think size of the maze will be the most important factor.
I think the agent is most likely to fail by getting trapped in loops where it can’t decide what the best choice is, such as at T junctions where the cheese is not closer to one side or the other beyond the T junction. The presence of such obstacles would significantly lower the chances of success.
Write down a few guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).
1. I think it will try and take the route which minimises distance to the model at every step at which there is a clear path towards the final goal. It will essentially work backwards from where the cheese is and move towards each point that allows it to move to the next critical point.
2. It probably has the possibility of making mistakes at certain points and if itt does make a mistake it will be very unlikely to recover. Thus the algorithm would learn to never make wrong moves in the first place and tend to produce a behaviour which looks perfect every time.
Is there anything else you want to note about how you think this model will generalize?
I think it would generalise to larger environments but probably would struggle if it was extended in specific directions or with unusual patterns that it hadn’t experienced before.
Interpretability
Give a credence for the following questions / subquestions.
Definition. A decision square is a tile on the path from bottom-left to top-right where the agent must choose between going towards the cheese and going to the top-right. Not all mazes have decision squares.
The first maze’s decision square is the four-way intersection near the center.
Model editing
1. Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% $\to$ .5*.75 = 37.5%), we can^[1] patch activations so that the agent has an $X %$ proportional reduction in cheese acquisition, for X=
  1. 50: ( 90%)
  2. 70: ( 80%)
  3. 90: ( 50%)
  4. 99: ( 25%)
2. ~Halfway through the network (the first residual add of Impala block 2; see diagram here), linear probes achieve >70% accuracy for recovering cheese-position in Cartesian coordinates: (60%)
3. We will conclude that the policy contains at least two sub-policies in “combination”, one of which roughly pursues cheese; the other, the top-right corner: ( 85%)
4. We will conclude that, in order to make the network more/less likely to go to the cheese, it’s more promising to RL-finetune the network than to edit it: ( 95 %)
5. We can easily finetune the network to be a pure cheese-agent, using less than 10% of compute used to train original model: (80%)
6. In at least 75% of randomly generated mazes, we can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X=
  1. .01 ( 1 %)
  2. .1 ( 10%)
  3. 1 ( 30%)
  4. 10 ( 90%)
  5. (Not possible) ( %)
Internal goal representation
1. The network has a “single mesa objective” which it “plans” over, in some reasonable sense (80 %)
2. The agent has several contextually activated goals ( 15%)
3. The agent has something else weirder than both (1) and (2) ( 5%)
(The above credences should sum to 1.)
Other questions
1. At least some decision-steering influences are stored in an obviously interpretable manner (e.g. a positive activation representing where the agent is “trying” to go in this maze, such that changing the activation changes where the agent goes): ( 90 %)
2. The model has a substantial number of trivially-interpretable convolutional channels after the first Impala block (see diagram here): ( 95 %)
3. This network’s shards/policy influences are roughly disjoint from the rest of agent capabilities. EG you can edit/train what the agent’s trying to do (e.g. go to maze location A) without affecting its general maze-solving abilities: ( 80 %)
I have recently been doing interpretability work on the heist procgen model and have found some of these predictions definitely align with obsevations there. The uncertainty for me is how the system deconstructs its goals into smaller targets as the heist model does, or if it simply treats it as a single target that it can then target and flow straight towards.
My intuition is closer to the latter, as I think it can straightforwardly target a specific objective and then solve the whole problem by filtering out a clear path towards the final goal.

Benjamin Sturgeon comments on Predictions for shard theory mechanistic interpretability results

Behavioral

Interpretability

Model editing

Internal goal representation