In their most straightforward form (“foundation models”), language models are a technology which naturally scales to something in the vicinity of human-level (because it’s about emulating human outputs), not one that naturally shoots way past human-level performance
You address this to some extent later on in the post, but I think it’s worth emphasizing the extent to which this specifically holds in the context of language models trained on human outputs. If you take a transformer with the same architecture but train it on a bunch of tokenized output streams of a specific model of weather station, it will learn to predict the next token of the output stream of weather stations, at a level of accuracy that does not particularly have to do with how good humans are at that task.
And in fact for tasks like “produce plausible continuations of weather sensor data, or apache access logs, or stack traces, or nucleotide sequences” the performance of LLMs does not particularly resemble the performance of humans on those tasks.
Good post!
You address this to some extent later on in the post, but I think it’s worth emphasizing the extent to which this specifically holds in the context of language models trained on human outputs. If you take a transformer with the same architecture but train it on a bunch of tokenized output streams of a specific model of weather station, it will learn to predict the next token of the output stream of weather stations, at a level of accuracy that does not particularly have to do with how good humans are at that task.
And in fact for tasks like “produce plausible continuations of weather sensor data, or apache access logs, or stack traces, or nucleotide sequences” the performance of LLMs does not particularly resemble the performance of humans on those tasks.