PeterMcCluskey comments on Predictions for shard theory mechanistic interpretability results

PeterMcCluskey 8 Mar 2023 3:45 UTC
3 points
0
I’ll guess on a few of them:

~Halfway through the network (the first residual add of Impala block 2; see diagram here), linear probes achieve >70% accuracy for recovering cheese-position in Cartesian coordinates: (80%)

We will conclude that the policy contains at least two sub-policies in “combination”, one of which roughly pursues cheese; the other, the top-right corner: (60%)

In at least 75% of randomly generated mazes, we can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X=
```
    .01 (15%)

    .1 (30%)

    1 (50%)

    10 (70%)
```
Other questions
```
At least some decision-steering influences are stored in an obviously interpretable manner (e.g. a positive activation representing where the agent is “trying” to go in this maze, such that changing the activation changes where the agent goes): (70%)

This network’s shards/policy influences are roughly disjoint from the rest of agent capabilities. EG you can edit/train what the agent’s trying to do (e.g. go to maze location A) without affecting its general maze-solving abilities: (30%)
```