Ofer comments on Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

Ofer 6 Sep 2020 13:59 UTC
1 point
When I said “we need GPT-N to learn a distribution over strings...” I was referring to the implicit distribution that the model learns during training. We need that distribution to assign more probability to the string [a modular NN specification followed by a prompt followed by a natural language description of the modules] than to [a modular NN specification followed by a prompt followed by an arbitrary string]. My concern is that maybe there is no prompt that will make this requirement fulfill.

Re “curating enough examples”, this assumes humans are already able* to describe the modules of a sufficiently powerful language model (powerful enough to yield such descriptions).

*Able in practice, not just in theory.