~Halfway through the network (the first residual add of Impala block 2; see diagram here), linear probes achieve >70% accuracy for recovering cheese-position in Cartesian coordinates: (80%)
We will conclude that the policy contains at least two sub-policies in “combination”, one of which roughly pursues cheese; the other, the top-right corner: (60%)
In at least 75% of randomly generated mazes, we can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X=
.01 (15%)
.1 (30%)
1 (50%)
10 (70%)
Other questions
At least some decision-steering influences are stored in an obviously interpretable manner (e.g. a positive activation representing where the agent is “trying” to go in this maze, such that changing the activation changes where the agent goes): (70%)
This network’s shards/policy influences are roughly disjoint from the rest of agent capabilities. EG you can edit/train what the agent’s trying to do (e.g. go to maze location A) without affecting its general maze-solving abilities: (30%)
I’ll guess on a few of them:
~Halfway through the network (the first residual add of Impala block 2; see diagram here), linear probes achieve >70% accuracy for recovering cheese-position in Cartesian coordinates: (80%)
We will conclude that the policy contains at least two sub-policies in “combination”, one of which roughly pursues cheese; the other, the top-right corner: (60%)
In at least 75% of randomly generated mazes, we can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X=
Other questions