Interpreting a Maze-Solving Network

20 Apr 2023 22:36 UTC

Mechanistic interpretability on a pretrained policy network from Goal Misgeneralization in Deep Reinforcement Learning.

Predictions for shard theory mechanistic interpretability results

TurnTrout, Ulisse Mini and peligrietzer

1 Mar 2023 5:16 UTC

105 points

10 comments5 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

11 Mar 2023 18:59 UTC

312 points

22 comments23 min readLW link

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

TurnTrout, peligrietzer and lisathiergart

31 Mar 2023 19:20 UTC

101 points

17 comments11 min readLW link

Behavioural statistics for a maze-solving agent

peligrietzer and TurnTrout

20 Apr 2023 22:26 UTC

44 points

11 comments10 min readLW link