(This is a reply to the “Induction Bump” Phase Change video by Catherine Olsson and the rest of Anthropic. I’m writing it here instead of as a YouTube comment because YouTube comments aren’t a good place for discussion.)
(Epistemic status: Speculative musings I had while following along, which might be useful for inspiring future experiments, or surfacing my own misunderstandings, and possibly duplicating ideas found in prior work which I have not surveyed.)
The change-in-loss sample after the bump (at 19:10) surprised me. As you say, it seemed to get noticeably better at things that correspond to bigram rules (or bigrammish rules). I was expecting the improvement to be in something that, in some sense, requires metalearning; whereas (if I’m understanding correctly), a depth-1 network would be able to start learning these particular bigrams starting on the very first training step. If this does correspond to learning bigrams, what could explain the delay before learning these particular bigrams starts?
My speculative guess is that there’s a pattern where some learnable features have stronger gradients than others, and that training a multi-layer (but not a single layer) network proceeds in stages where first (a) multiple layer-1 nodes race to learn a function (loss falls rapidly), then (b) layer-2 nodes figure out which layer-1 nodes won the race and sever their connections to the ones that didn’t (loss falls slowly), and then (c) the freed-up layer-1 neurons move on to learning something else (loss falls rapidly again). Under this hypothesis, “the bump” corresponds to part of this sequence.
This seems to match what we see in the graph at 30:20. In that graph (which is training-step vs induction score for selected nodes in layer 1), some of the nodes go to 1, and some of the nodes start going to 1, then reverse direction shortly after the first set of nodes hits their asymptote. This is what you’d expect to see if the nodes that reverse direction were headed towards being duplicates of nodes that didn’t, but which lost the race and then got repurposed.
This seems like it might be a useful microcosm for studying the behavior of redundant nodes in general. (We know large networks have a lot of redundancy because SqueezeNets work). One major worry about neural-net transparency is that we might end up in a state where we can inspect neural nets and find things that are there, but that the methods that do that won’t be able to assert that things *aren’t* there. Concretely, this might look like finding nodes that track an unwanted concept, pinning or ablating those nodes, and then finding that the concept still presented due via duplicate nodes or via an off-basis aggregation of nodes that seem unrelated.
(This is a reply to the “Induction Bump” Phase Change video by Catherine Olsson and the rest of Anthropic. I’m writing it here instead of as a YouTube comment because YouTube comments aren’t a good place for discussion.)
(Epistemic status: Speculative musings I had while following along, which might be useful for inspiring future experiments, or surfacing my own misunderstandings, and possibly duplicating ideas found in prior work which I have not surveyed.)
The change-in-loss sample after the bump (at 19:10) surprised me. As you say, it seemed to get noticeably better at things that correspond to bigram rules (or bigrammish rules). I was expecting the improvement to be in something that, in some sense, requires metalearning; whereas (if I’m understanding correctly), a depth-1 network would be able to start learning these particular bigrams starting on the very first training step. If this does correspond to learning bigrams, what could explain the delay before learning these particular bigrams starts?
My speculative guess is that there’s a pattern where some learnable features have stronger gradients than others, and that training a multi-layer (but not a single layer) network proceeds in stages where first (a) multiple layer-1 nodes race to learn a function (loss falls rapidly), then (b) layer-2 nodes figure out which layer-1 nodes won the race and sever their connections to the ones that didn’t (loss falls slowly), and then (c) the freed-up layer-1 neurons move on to learning something else (loss falls rapidly again). Under this hypothesis, “the bump” corresponds to part of this sequence.
This seems to match what we see in the graph at 30:20. In that graph (which is training-step vs induction score for selected nodes in layer 1), some of the nodes go to 1, and some of the nodes start going to 1, then reverse direction shortly after the first set of nodes hits their asymptote. This is what you’d expect to see if the nodes that reverse direction were headed towards being duplicates of nodes that didn’t, but which lost the race and then got repurposed.
This seems like it might be a useful microcosm for studying the behavior of redundant nodes in general. (We know large networks have a lot of redundancy because SqueezeNets work). One major worry about neural-net transparency is that we might end up in a state where we can inspect neural nets and find things that are there, but that the methods that do that won’t be able to assert that things *aren’t* there. Concretely, this might look like finding nodes that track an unwanted concept, pinning or ablating those nodes, and then finding that the concept still presented due via duplicate nodes or via an off-basis aggregation of nodes that seem unrelated.