Interpreting a Maze-Solving Network

Mechanistic interpretability on a pretrained policy network from Goal Misgeneralization in Deep Reinforcement Learning.

Pre­dic­tions for shard the­ory mechanis­tic in­ter­pretabil­ity results

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

Maze-solv­ing agents: Add a top-right vec­tor, make the agent go to the top-right

Be­havi­oural statis­tics for a maze-solv­ing agent