(i haven’t done any interpretability research, and i’m just trying to think about this idea logically) this seems like a good idea to me! it’s possible that the same neural patterns in the small model happen in the larger ones to generate those outputs. if this is only true some of the time, but sometimes the large model does some different process (e.g “simulating the underlying real-world process which led to that output”) then that could also be interesting.
(i haven’t done any interpretability research, and i’m just trying to think about this idea logically) this seems like a good idea to me! it’s possible that the same neural patterns in the small model happen in the larger ones to generate those outputs. if this is only true some of the time, but sometimes the large model does some different process (e.g “simulating the underlying real-world process which led to that output”) then that could also be interesting.