What does it even mean to be a gradualist about any of the important questions like those of the Gwern-voice, when they don’t relate in known ways to the trend lines that are smooth?
Perplexity is one general “intrinsic” measure of language models, but there are many task-specific measures too. Studying the relationship between perplexity and task-specific measures is an important part of the research process. We shouldn’t speak as if people do not actively try to uncover these relationships.
I would generally be surprised if there were many highly non-linear relationship between perplexity and something like Winograd accuracy, human evaluation, or whatever other concrete measure you can come up with, such that the underlying behavior of the surface phenomenon is best described as a discontinuity with the past even when the latent perplexity changed smoothly. I admit the existence of some measures that exhibit these qualities (such as, potentially, the ability to do arithmetic), but I expect them to be quite a bit harder to find than the reverse.
Furthermore, it seems like if this is the crux — ie. that surface-level qualitative phenomena will experience discontinuities even while latent variables do not — then I do not understand why it’s hard to come up with bet conditions.
Can’t you just pick a surface level phenomenon that’s easy to measure and strongly interpretable in a qualitative sense — like Sensibleness and Specificity Average from the paper on Google’s chatbot — and then predict discontinuities in that metric?
(I should note that the paper shows a highly linear relationship between perplexity and Sensibleness and Specificity Average. Just look at the first plot in the PDF.)
Perplexity is one general “intrinsic” measure of language models, but there are many task-specific measures too. Studying the relationship between perplexity and task-specific measures is an important part of the research process. We shouldn’t speak as if people do not actively try to uncover these relationships.
I would generally be surprised if there were many highly non-linear relationship between perplexity and something like Winograd accuracy, human evaluation, or whatever other concrete measure you can come up with, such that the underlying behavior of the surface phenomenon is best described as a discontinuity with the past even when the latent perplexity changed smoothly. I admit the existence of some measures that exhibit these qualities (such as, potentially, the ability to do arithmetic), but I expect them to be quite a bit harder to find than the reverse.
Furthermore, it seems like if this is the crux — ie. that surface-level qualitative phenomena will experience discontinuities even while latent variables do not — then I do not understand why it’s hard to come up with bet conditions.
Can’t you just pick a surface level phenomenon that’s easy to measure and strongly interpretable in a qualitative sense — like Sensibleness and Specificity Average from the paper on Google’s chatbot — and then predict discontinuities in that metric?
(I should note that the paper shows a highly linear relationship between perplexity and Sensibleness and Specificity Average. Just look at the first plot in the PDF.)