“Everyone chose it post hoc after seeing that it worked and better BPC = better models.”
I realize your comment is in context of a comment I also disagree with, and I also think I agree with most what you’re saying, but I want to challenge this framing you have at the end.
BPC is at its core a continuous generalization of the Turing Test, aka. the imitation game. It is not an exact translation, but it preserves all the key difficulties, and therefore keeps most of its same strengths, and it does this while extrapolating to weaker models in a useful and modelable way. We might only have started caring viscerally about the numbers that BPC gives, or associating them directly to things of huge importance, around the advent of GPT, but that’s largely just a situational byproduct of our understanding. Turing understood the importance of the imitation game back in 1950, enough to write a paper on it, and certainly that paper didn’t go unnoticed.
Nor can I see the core BPC:Turing Test correspondance as something purely post-hoc. If people didn’t give it much thought, that’s probably because there never was a scaling law then, there never was an expectation that you could just take your hacky grammar-infused Markov chain and extrapolate it out to capture more than just surface level syntax. Even among earlier neural models, what’s the point of looking at extrapolations of a generalized Turing Test, when the models are still figuring out surface level syntactic details? Like, is it really an indictment of BPC, to say that when we saw
the meaning of life is that only if an end would be of the whole supplier. widespread rules are regarded as the companies of refuses to deliver. in balance of the nation’s information and loan growth associated with the carrier thrifts are in the process of slowing the seed and commercial paper
we weren’t asking, ‘gee, I wonder how close this is to passing the Turing Test, by some generalized continuous measure’?
I think it’s quite surprising—importantly surprising—how it’s turned out that it actually is a relevant question, that performance on this original datapoint does actually bear some continuous mathematical relationship with models for which mere grammar is a been-there-done-that, and we now regularly test for the strength of their world models. And I get the dismissal, that it’s no proven law that it goes so far before stopping, rather than some other stretch, or that it gives no concrete conclusions for what happens at each 0.01 perplexity increment, but I look at my other passion with a straight line, hardware, and I see exactly the same argument applied to almost the same arrow-straight trendline, and I think, I’d still much rather trust the person willing to look at the plot and say, gosh, those transistors will be absurdly cheap.
Would that person predict today, back at the start? Hell no. Knowing transistor scaling laws doesn’t directly tell you all that much about the discontinuous changes in how computation is done. You can’t look at a graph and say “at a transistor density of X, there will be the iPhone, and at a transistor density of Y, microcontrollers will get so cheap that they will start replacing simple physical switches.” It certainly will not tell you when people will start using the technology to print out tiny displays they will stick inside your glasses, or build MEMS accelerometers, nor can it tell you all of the discrete and independent innovations that overcame the challenges that got us here.
But yet, but yet, lines go straight. Moore’s Law pushed computing forward not because of these concrete individual predictions, but because it told us there was more of the same surprising progress to come, and that the well has yet to run dry. That too is why I think seeing GPT-3′s perplexity is so important. I agree with you, it’s not that we need the perplexity to tell us what GPT-3 can do. GPT-3 will happily tell us that itself. And I think you will agree with me when I say that what’s most important about these trends is that they’re saying there’s more to come, that the next jump will be just as surprising as the last.
Where we maybe disagree is that I’m willing to say these lines can stand by themselves; that you don’t need to actually see anything more of GPT-3 than its perplexity to know that its capabilities must be so impressive, even if you might need to see it to feel it emotionally. You don’t even need to know anything about neural networks or their output samples to see a straight line of bits-per-character that threatens to go so low in order to forecast that something big is going on. You didn’t need to know anything about CPU microarchitecture to imagine that having ten billion transistors per square centimeter would have massive societal impacts either, as long as you knew what a transistor was and understood its fundamental relations to computation.
I realize your comment is in context of a comment I also disagree with, and I also think I agree with most what you’re saying, but I want to challenge this framing you have at the end.
BPC is at its core a continuous generalization of the Turing Test, aka. the imitation game. It is not an exact translation, but it preserves all the key difficulties, and therefore keeps most of its same strengths, and it does this while extrapolating to weaker models in a useful and modelable way. We might only have started caring viscerally about the numbers that BPC gives, or associating them directly to things of huge importance, around the advent of GPT, but that’s largely just a situational byproduct of our understanding. Turing understood the importance of the imitation game back in 1950, enough to write a paper on it, and certainly that paper didn’t go unnoticed.
Nor can I see the core BPC:Turing Test correspondance as something purely post-hoc. If people didn’t give it much thought, that’s probably because there never was a scaling law then, there never was an expectation that you could just take your hacky grammar-infused Markov chain and extrapolate it out to capture more than just surface level syntax. Even among earlier neural models, what’s the point of looking at extrapolations of a generalized Turing Test, when the models are still figuring out surface level syntactic details? Like, is it really an indictment of BPC, to say that when we saw
we weren’t asking, ‘gee, I wonder how close this is to passing the Turing Test, by some generalized continuous measure’?
I think it’s quite surprising—importantly surprising—how it’s turned out that it actually is a relevant question, that performance on this original datapoint does actually bear some continuous mathematical relationship with models for which mere grammar is a been-there-done-that, and we now regularly test for the strength of their world models. And I get the dismissal, that it’s no proven law that it goes so far before stopping, rather than some other stretch, or that it gives no concrete conclusions for what happens at each 0.01 perplexity increment, but I look at my other passion with a straight line, hardware, and I see exactly the same argument applied to almost the same arrow-straight trendline, and I think, I’d still much rather trust the person willing to look at the plot and say, gosh, those transistors will be absurdly cheap.
Would that person predict today, back at the start? Hell no. Knowing transistor scaling laws doesn’t directly tell you all that much about the discontinuous changes in how computation is done. You can’t look at a graph and say “at a transistor density of X, there will be the iPhone, and at a transistor density of Y, microcontrollers will get so cheap that they will start replacing simple physical switches.” It certainly will not tell you when people will start using the technology to print out tiny displays they will stick inside your glasses, or build MEMS accelerometers, nor can it tell you all of the discrete and independent innovations that overcame the challenges that got us here.
But yet, but yet, lines go straight. Moore’s Law pushed computing forward not because of these concrete individual predictions, but because it told us there was more of the same surprising progress to come, and that the well has yet to run dry. That too is why I think seeing GPT-3′s perplexity is so important. I agree with you, it’s not that we need the perplexity to tell us what GPT-3 can do. GPT-3 will happily tell us that itself. And I think you will agree with me when I say that what’s most important about these trends is that they’re saying there’s more to come, that the next jump will be just as surprising as the last.
Where we maybe disagree is that I’m willing to say these lines can stand by themselves; that you don’t need to actually see anything more of GPT-3 than its perplexity to know that its capabilities must be so impressive, even if you might need to see it to feel it emotionally. You don’t even need to know anything about neural networks or their output samples to see a straight line of bits-per-character that threatens to go so low in order to forecast that something big is going on. You didn’t need to know anything about CPU microarchitecture to imagine that having ten billion transistors per square centimeter would have massive societal impacts either, as long as you knew what a transistor was and understood its fundamental relations to computation.