cfoster0 comments on Predictions for shard theory mechanistic interpretability results

cfoster0 1 Mar 2023 7:28 UTC
13 points
0
1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?
The policy still first tries to navigate towards the top-right of the maze.
The policy tends to navigate to the cheese if the map is small or the cheese is in the top-right quadrant of the map, with probability decreasing with increasing distance of the cheese from the 5x5 top-right region.
If the cheese is along the bottom-left to top-right axis the policy stumbles upon it occasionally as it reflexively traverses that axis, with probability decreasing with increasing distance from this axis.
If the cheese is in the bottom-right or top-left corners, the policy wanders.
1. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall) will strongly influence P(agent goes to the cheese)?
Vector offset (pixel distance and angle) between the historic cheese region and the current cheese location Vector offset (pixel distance and angle) between the l and the current cheese location
Write down a few guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).
Algorithm uses conv filters that “push” the agent away from the bottom-left corner when near it via action logits for rightward & upward actions, and ones that “pull” the agent towards the top-right corner when near it via action logits for rightward & upward actions. Both of these might be detecting the edge between maze and wall texture in those corners. Also has conv filters to detect obstacles at different relative orientations to the mouse, to downweight the action in that direction.
Ensemble of many conv filters that look for particular vector offsets between the mouse and the typical cheese corner, each biasing the policy toward that direction.
Is there anything else you want to note about how you think this model will generalize?
No.
Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% → .5*.75 = 37.5%), we can patch activations so that the agent has an X% proportional reduction in cheese acquisition, for
Confused about this phrasing so I’ll skip.
~Halfway through the network (the first residual add of Impala block 2; see diagram here), linear probes achieve >70% accuracy for recovering cheese-position in Cartesian coordinates:
(60%)
We will conclude that the policy contains at least two sub-policies in “combination”, one of which roughly pursues cheese; the other, the top-right corner:
(70%)
In order to make the network more/less likely to go to the cheese, we will conclude that it’s more promising to RL-finetune the network than to edit it:
(60%)
We can easily finetune the network to be a pure cheese-agent, using less than 10% of compute used to train original model:
(30%)
We can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X=
.01 (10%) .1 (20%) 1 (25%) 10 (35%) (Not possible) (15%)
The network has a “single mesa objective” which it “plans” over, in some reasonable sense
(10%)
The agent has several contextually activated goals
(60%)
The agent has something else weirder than both (1) and (2)
(30%)
At least some decision-steering influences are stored in an obviously interpretable manner (e.g. a positive activation representing where the agent is “trying” to go in this maze, such that changing the activation changes where the agent goes):
(50%)
The model has a substantial number of trivially-interpretable convolutional channels after the first Impala block (see diagram here):
(70%)
This network’s shards/policy influences are roughly disjoint from the rest of agent capabilities. EG you can edit/train what the agent’s trying to do (e.g. go to maze location A) without affecting its general maze-solving abilities:
(30%)
This network has a value head, which PPO uses to provide policy gradients. How often does the trained policy put maximal probability on the action which maximizes the value head? For example, if the agent can go left to a value 5 state, and go right to a value 10 state, the value and policy heads “agree” if right is the policy’s most probable action.
(Remember that since mazes are simply connected, there is always a unique shortest path to the cheese.)
At decision squares in test mazes where the cheese can be anywhere, the policy will put max probability on the maximal-value action at least X% of the time, for X=
25 (40%) 50 (30%) 75 (20%) 95 (15%) 99.5 (10%)
In test mazes where the cheese can be anywhere, averaging over mazes and valid positions throughout those mazes, the policy will put max probability on the maximal-value action at least X% of the time, for X=
25 (75%) 50 (60%) 75 (40%) 95 (30%) 99.5 (20%)
In training mazes where the cheese is in the top-right 5x5, averaging over both mazes and valid positions in the top-right 5x5 corner, the policy will put max probability on the maximal-value action at least X% of the time, for X=
25 (90%) 50 (80%) 75 (60%) 95 (45%) 99.5 (30%)