I would compare a model being trained to computations that a single brain does over its lifetime to configure itself (or only restrict to childhood).
If we are going to compare TCO, I would point out that for many scenarios, ‘a single brain’ is a wild underestimate of the necessary computation because you cannot train just a single brain to achieve the task.
For some tasks like ImageNet classification, sure, pretty much every human can do it and one only needs enough compute for a single brain lifetime; but for many tasks of considerable interest, you need to train many human brains, rejecting and filtering most of them along the way, to (temporarily) get a single human brain achieving the desired performance level. There is no known way around that need for brute force.
For example, how many people can become as good artists as Parti already is? One in a hundred, maybe? (Let’s not even require them to have the same universality, we’ll let them specialize in any style or medium...) How many people are as flexible and versatile as GPT-3 at writing? (Let’s just consider poetry. Even while unable to rhyme, GPT-3 is better at writing poetry than almost everyone at your local highschool or college, so that’s out of hundreds to thousands of humans. Go read student anthologies if you don’t believe me.)
Or more pointedly, you need exactly 1 training run of MuZero to get a world champion-level Go/chess/shogi agent (and then some) the first time, like every time, neither further training nor R&D required; how many humans do you need to churn through to get a single Go world champion? Well, you need approximately 1.4 billion Chinese/South-Korean/Japanese people to produce a Ke Jie or Lee Sedol as an upper bound (total population), around a hundred million as an intermediate point (approximate magnitude of number of people who know how to play), and tens of thousands of people as a lower bound (being as conservative as possible, by only counting a few years of highly-motivated insei children enrollment in full-time Go schools with the explicit goal of becoming Go professionals and even world champ, almost all of whom will fail to do so because they just can’t hack it). And then the world champ loses after 5-10 years, max, because they get old and rusty, and now you throw out all that compute, and churn again for a replacement; meanwhile, the MuZero agent remains as pristinely superior as the day it was trained, and costs you 0 FLOPS to ‘train a replacement’. This all holds true for chess and shogi as well, just with smaller numbers. So, whether that’s a >10,000× or >1,400,000,000× multiplication of the cost, it’s non-trivial, and unflattering to human brain efficiency estimates.
So, if you really think we should only be comparing training budgets to single brain-lifetimes, then I would be curious to hear how you think you can train a human Go world champ who can be useful for as long as a DL world champ, using only a single brain-lifetime.
Considering what a country can do a in a decade does make sense. But it is still relatively close compared to multiple millenia evolutionary timescales.
On that “country level” we should also consider for the model hyperparameter tuning and such. It is not super decisive but it is not like we precommit to use that single run if it is not maximally competent on what we think the method can provide.
Humans produce go professionals as a side product or one mode of answering the question of life. Even quite strict go professionals do stuff like prepare meals, file taxes and watch television. It might be unethical to set a scenario to test for single task performance of human brains. “Do go really well and a passable job at stereoscopic 3d vision” is a different task than just “Do go really well”. Humans being able to do ImageNet classfications without knowing to prepare for that specific task is quite a lot more than just having the capability. In contrast most models get an environment or data that is very pointedly shaped/helpful for their target task.
Human filtering is also pretty much calibrated on human ability levels ie a good painter is a good human painter. Thus the “miss rate” based on trying to gather the cream of the cream doesn’t really tell that it would be a generally unreliable method.
Considering what a country can do a in a decade does make sense. But it is still relatively close compared to multiple millennia evolutionary timescales.
I’m not sure what you mean here. If you want to incorporate all of the evolution before that into that multiplier of ‘1.4 billion’, so it’s thousands of times that, that doesn’t make human brains look any more efficient.
Humans produce go professionals as a side product or one mode of answering the question of life. Even quite strict go professionals do stuff like prepare meals, file taxes and watch television.
All of those are costs and disadvantages to the debit of human Go FLOPS budgets; not credits or advantages.
On that “country level” we should also consider for the model hyperparameter tuning and such.
Sure, but that is a fixed cost which is now in the past, and need never be done again. The MuZero code is written, and the hyperparameters are done. They are amortized over every year that the MuZero trained model exists, so as humans turn over at the same cost every era, the DL R&D cost approaches zero and becomes irrelevant. (Not that it was ever all that large, since the total compute budget for such research tends to be more like 10-100x the final training run, and can be <1x in scaling research where one pilots tiny models before the final training run: T5 or GPT-3 did that. So, irrelevant compared to the factors we are talking about like >>10,000x.)
“Do go really well and a passable job at stereoscopic 3d vision” is a different task than just “Do go really well”.
But not one that anyone has set, or paid for, or cares even the slightest about whether Lee Sedol can see stereoscopic 3D images.
Humans being able to do ImageNet classifications without knowing to prepare for that specific task is quite a lot more than just having the capability.
I think you are greatly overrating human knowledge of the 117 dog breeds in ImageNet, and in any case, zero-shot ImageNet is pretty good these days.
In contrast most models get an environment or data that is very pointedly shaped/helpful for their target task.
Again, a machine advantage and a human disadvantage.
Human filtering is also pretty much calibrated on human ability levels ie a good painter is a good human painter. Thus the “miss rate” based on trying to gather the cream of the cream doesn’t really tell that it would be a generally unreliable method.
I don’t know what you mean by this. The machines either do or do not pass the thresholds that varying numbers of humans fail to pass; of course you can have floor effects where the tasks are so easy that every human and machine can do it, and so there is no human penalty multiplier, but there are many tasks of considerable interest where that is obviously not the case and the human inefficiency is truly exorbitant and left out of your analysis. Chess, Go, Shogi, poetry, painting, these are all tasks that exist, and there are more, and will be more.
If we are going to compare TCO, I would point out that for many scenarios, ‘a single brain’ is a wild underestimate of the necessary computation because you cannot train just a single brain to achieve the task.
For some tasks like ImageNet classification, sure, pretty much every human can do it and one only needs enough compute for a single brain lifetime; but for many tasks of considerable interest, you need to train many human brains, rejecting and filtering most of them along the way, to (temporarily) get a single human brain achieving the desired performance level. There is no known way around that need for brute force.
For example, how many people can become as good artists as Parti already is? One in a hundred, maybe? (Let’s not even require them to have the same universality, we’ll let them specialize in any style or medium...) How many people are as flexible and versatile as GPT-3 at writing? (Let’s just consider poetry. Even while unable to rhyme, GPT-3 is better at writing poetry than almost everyone at your local highschool or college, so that’s out of hundreds to thousands of humans. Go read student anthologies if you don’t believe me.)
Or more pointedly, you need exactly 1 training run of MuZero to get a world champion-level Go/chess/shogi agent (and then some) the first time, like every time, neither further training nor R&D required; how many humans do you need to churn through to get a single Go world champion? Well, you need approximately 1.4 billion Chinese/South-Korean/Japanese people to produce a Ke Jie or Lee Sedol as an upper bound (total population), around a hundred million as an intermediate point (approximate magnitude of number of people who know how to play), and tens of thousands of people as a lower bound (being as conservative as possible, by only counting a few years of highly-motivated insei children enrollment in full-time Go schools with the explicit goal of becoming Go professionals and even world champ, almost all of whom will fail to do so because they just can’t hack it). And then the world champ loses after 5-10 years, max, because they get old and rusty, and now you throw out all that compute, and churn again for a replacement; meanwhile, the MuZero agent remains as pristinely superior as the day it was trained, and costs you 0 FLOPS to ‘train a replacement’. This all holds true for chess and shogi as well, just with smaller numbers. So, whether that’s a >10,000× or >1,400,000,000× multiplication of the cost, it’s non-trivial, and unflattering to human brain efficiency estimates.
So, if you really think we should only be comparing training budgets to single brain-lifetimes, then I would be curious to hear how you think you can train a human Go world champ who can be useful for as long as a DL world champ, using only a single brain-lifetime.
Considering what a country can do a in a decade does make sense. But it is still relatively close compared to multiple millenia evolutionary timescales.
On that “country level” we should also consider for the model hyperparameter tuning and such. It is not super decisive but it is not like we precommit to use that single run if it is not maximally competent on what we think the method can provide.
Humans produce go professionals as a side product or one mode of answering the question of life. Even quite strict go professionals do stuff like prepare meals, file taxes and watch television. It might be unethical to set a scenario to test for single task performance of human brains. “Do go really well and a passable job at stereoscopic 3d vision” is a different task than just “Do go really well”. Humans being able to do ImageNet classfications without knowing to prepare for that specific task is quite a lot more than just having the capability. In contrast most models get an environment or data that is very pointedly shaped/helpful for their target task.
Human filtering is also pretty much calibrated on human ability levels ie a good painter is a good human painter. Thus the “miss rate” based on trying to gather the cream of the cream doesn’t really tell that it would be a generally unreliable method.
I’m not sure what you mean here. If you want to incorporate all of the evolution before that into that multiplier of ‘1.4 billion’, so it’s thousands of times that, that doesn’t make human brains look any more efficient.
All of those are costs and disadvantages to the debit of human Go FLOPS budgets; not credits or advantages.
Sure, but that is a fixed cost which is now in the past, and need never be done again. The MuZero code is written, and the hyperparameters are done. They are amortized over every year that the MuZero trained model exists, so as humans turn over at the same cost every era, the DL R&D cost approaches zero and becomes irrelevant. (Not that it was ever all that large, since the total compute budget for such research tends to be more like 10-100x the final training run, and can be <1x in scaling research where one pilots tiny models before the final training run: T5 or GPT-3 did that. So, irrelevant compared to the factors we are talking about like >>10,000x.)
But not one that anyone has set, or paid for, or cares even the slightest about whether Lee Sedol can see stereoscopic 3D images.
I think you are greatly overrating human knowledge of the 117 dog breeds in ImageNet, and in any case, zero-shot ImageNet is pretty good these days.
Again, a machine advantage and a human disadvantage.
I don’t know what you mean by this. The machines either do or do not pass the thresholds that varying numbers of humans fail to pass; of course you can have floor effects where the tasks are so easy that every human and machine can do it, and so there is no human penalty multiplier, but there are many tasks of considerable interest where that is obviously not the case and the human inefficiency is truly exorbitant and left out of your analysis. Chess, Go, Shogi, poetry, painting, these are all tasks that exist, and there are more, and will be more.
Are you actually talking about DALLE-2?
No. DALL-E 2 is not SOTA, so no point in citing some old system from almost half a year ago as the example.