It is quite blatant on what it does but this is pretty much statistics hacking.
I would compare a model being trained to computations that a single brain does over its lifetime to configure itself (or only restrict to childhood).
For brain evolution analog I would include the brain metabolism of the computer scientists developing the next version of the model too.
I would not hold against the CPU the inefficiency of the solar panels. Likewise I don’t see how it is fair to blame the brain on the inefficiency of the gut. In case we can blame the gut then we should compare how much the model causes its electricity supply to increase which for many is equal to 0.
I would compare a model being trained to computations that a single brain does over its lifetime to configure itself (or only restrict to childhood).
If we are going to compare TCO, I would point out that for many scenarios, ‘a single brain’ is a wild underestimate of the necessary computation because you cannot train just a single brain to achieve the task.
For some tasks like ImageNet classification, sure, pretty much every human can do it and one only needs enough compute for a single brain lifetime; but for many tasks of considerable interest, you need to train many human brains, rejecting and filtering most of them along the way, to (temporarily) get a single human brain achieving the desired performance level. There is no known way around that need for brute force.
For example, how many people can become as good artists as Parti already is? One in a hundred, maybe? (Let’s not even require them to have the same universality, we’ll let them specialize in any style or medium...) How many people are as flexible and versatile as GPT-3 at writing? (Let’s just consider poetry. Even while unable to rhyme, GPT-3 is better at writing poetry than almost everyone at your local highschool or college, so that’s out of hundreds to thousands of humans. Go read student anthologies if you don’t believe me.)
Or more pointedly, you need exactly 1 training run of MuZero to get a world champion-level Go/chess/shogi agent (and then some) the first time, like every time, neither further training nor R&D required; how many humans do you need to churn through to get a single Go world champion? Well, you need approximately 1.4 billion Chinese/South-Korean/Japanese people to produce a Ke Jie or Lee Sedol as an upper bound (total population), around a hundred million as an intermediate point (approximate magnitude of number of people who know how to play), and tens of thousands of people as a lower bound (being as conservative as possible, by only counting a few years of highly-motivated insei children enrollment in full-time Go schools with the explicit goal of becoming Go professionals and even world champ, almost all of whom will fail to do so because they just can’t hack it). And then the world champ loses after 5-10 years, max, because they get old and rusty, and now you throw out all that compute, and churn again for a replacement; meanwhile, the MuZero agent remains as pristinely superior as the day it was trained, and costs you 0 FLOPS to ‘train a replacement’. This all holds true for chess and shogi as well, just with smaller numbers. So, whether that’s a >10,000× or >1,400,000,000× multiplication of the cost, it’s non-trivial, and unflattering to human brain efficiency estimates.
So, if you really think we should only be comparing training budgets to single brain-lifetimes, then I would be curious to hear how you think you can train a human Go world champ who can be useful for as long as a DL world champ, using only a single brain-lifetime.
Considering what a country can do a in a decade does make sense. But it is still relatively close compared to multiple millenia evolutionary timescales.
On that “country level” we should also consider for the model hyperparameter tuning and such. It is not super decisive but it is not like we precommit to use that single run if it is not maximally competent on what we think the method can provide.
Humans produce go professionals as a side product or one mode of answering the question of life. Even quite strict go professionals do stuff like prepare meals, file taxes and watch television. It might be unethical to set a scenario to test for single task performance of human brains. “Do go really well and a passable job at stereoscopic 3d vision” is a different task than just “Do go really well”. Humans being able to do ImageNet classfications without knowing to prepare for that specific task is quite a lot more than just having the capability. In contrast most models get an environment or data that is very pointedly shaped/helpful for their target task.
Human filtering is also pretty much calibrated on human ability levels ie a good painter is a good human painter. Thus the “miss rate” based on trying to gather the cream of the cream doesn’t really tell that it would be a generally unreliable method.
Considering what a country can do a in a decade does make sense. But it is still relatively close compared to multiple millennia evolutionary timescales.
I’m not sure what you mean here. If you want to incorporate all of the evolution before that into that multiplier of ‘1.4 billion’, so it’s thousands of times that, that doesn’t make human brains look any more efficient.
Humans produce go professionals as a side product or one mode of answering the question of life. Even quite strict go professionals do stuff like prepare meals, file taxes and watch television.
All of those are costs and disadvantages to the debit of human Go FLOPS budgets; not credits or advantages.
On that “country level” we should also consider for the model hyperparameter tuning and such.
Sure, but that is a fixed cost which is now in the past, and need never be done again. The MuZero code is written, and the hyperparameters are done. They are amortized over every year that the MuZero trained model exists, so as humans turn over at the same cost every era, the DL R&D cost approaches zero and becomes irrelevant. (Not that it was ever all that large, since the total compute budget for such research tends to be more like 10-100x the final training run, and can be <1x in scaling research where one pilots tiny models before the final training run: T5 or GPT-3 did that. So, irrelevant compared to the factors we are talking about like >>10,000x.)
“Do go really well and a passable job at stereoscopic 3d vision” is a different task than just “Do go really well”.
But not one that anyone has set, or paid for, or cares even the slightest about whether Lee Sedol can see stereoscopic 3D images.
Humans being able to do ImageNet classifications without knowing to prepare for that specific task is quite a lot more than just having the capability.
I think you are greatly overrating human knowledge of the 117 dog breeds in ImageNet, and in any case, zero-shot ImageNet is pretty good these days.
In contrast most models get an environment or data that is very pointedly shaped/helpful for their target task.
Again, a machine advantage and a human disadvantage.
Human filtering is also pretty much calibrated on human ability levels ie a good painter is a good human painter. Thus the “miss rate” based on trying to gather the cream of the cream doesn’t really tell that it would be a generally unreliable method.
I don’t know what you mean by this. The machines either do or do not pass the thresholds that varying numbers of humans fail to pass; of course you can have floor effects where the tasks are so easy that every human and machine can do it, and so there is no human penalty multiplier, but there are many tasks of considerable interest where that is obviously not the case and the human inefficiency is truly exorbitant and left out of your analysis. Chess, Go, Shogi, poetry, painting, these are all tasks that exist, and there are more, and will be more.
It is quite blatant on what it does but this is pretty much statistics hacking.
Like I said, there’s plenty of uncertainty in FLOP/s. Maybe it’s helpful if rephrase this as an invitation for everyone to make their own modifications to the model.
I would compare a model being trained to computations that a single brain does over its lifetime to configure itself (or only restrict to childhood).
Cotra’s lifetime anchor is 1029 FLOPs (so 4-5 OOMs above gradient descent). That’s still quite a chasm.
For brain evolution analog I would include the brain metabolism of the computer scientists developing the next version of the model too.
Do you mean including the CS brain activity towards the computed costs of training the model?
I would not hold against the CPU the inefficiency of the solar panels. Likewise I don’t see how it is fair to blame the brain on the inefficiency of the gut. In case we can blame the gut then we should compare how much the model causes its electricity supply to increase which for many is equal to 0.
If you’re asking yourself whether or not you want to automate a certain role, then a practical subquestion is how much you have to spend on maintenance/fuel (i.e., electricity or food)? Then, I do think the acknowledging the source of the fuel becomes important.
Yes, I think that GPT-1 turning to GPT-2 and GPT-3 is the thing that is analogous with building brains out of new combinations of dna. Having an instance of GPT-3 to hone its weights and a single brain cutting and forming its connections are comparable. When doing fermi-estimates getting the ballpark wrong is pretty fatal as it is in the core of the activity. With that much conceptual confusion going on I don’t care about the numbers. To claim that other are making mistakes and not surviving a cursory look does not bode well for convincingness. I don’t care to get lured by pretty graphs to think my ignorance is more informed than it is.
If I know that the researches looks at the data until they find a correlation with p>0.05 that they found someting is not really significant news. Similarly if you keeping changing your viewpoint until you find an angle where orderings seem to reverse its less convincing that this one viewpoint is the one that matters.
Economically I would be interested in ability to change electricity to sugar and sugar to electricity. But because the end product is not the same the processes are not nearly economically interchangable. Go a long way in this direction and you measure everything in dollars. But typically when we care to specify that we care about energy efficiency and not example time efficiency we are going for more dimensions and more considerations rather than less.
To set terminology so that if gas prices go up then the energy efficiency of everything that uses gas goes down does not seem handy to me.
It is quite blatant on what it does but this is pretty much statistics hacking.
I would compare a model being trained to computations that a single brain does over its lifetime to configure itself (or only restrict to childhood).
For brain evolution analog I would include the brain metabolism of the computer scientists developing the next version of the model too.
I would not hold against the CPU the inefficiency of the solar panels. Likewise I don’t see how it is fair to blame the brain on the inefficiency of the gut. In case we can blame the gut then we should compare how much the model causes its electricity supply to increase which for many is equal to 0.
If we are going to compare TCO, I would point out that for many scenarios, ‘a single brain’ is a wild underestimate of the necessary computation because you cannot train just a single brain to achieve the task.
For some tasks like ImageNet classification, sure, pretty much every human can do it and one only needs enough compute for a single brain lifetime; but for many tasks of considerable interest, you need to train many human brains, rejecting and filtering most of them along the way, to (temporarily) get a single human brain achieving the desired performance level. There is no known way around that need for brute force.
For example, how many people can become as good artists as Parti already is? One in a hundred, maybe? (Let’s not even require them to have the same universality, we’ll let them specialize in any style or medium...) How many people are as flexible and versatile as GPT-3 at writing? (Let’s just consider poetry. Even while unable to rhyme, GPT-3 is better at writing poetry than almost everyone at your local highschool or college, so that’s out of hundreds to thousands of humans. Go read student anthologies if you don’t believe me.)
Or more pointedly, you need exactly 1 training run of MuZero to get a world champion-level Go/chess/shogi agent (and then some) the first time, like every time, neither further training nor R&D required; how many humans do you need to churn through to get a single Go world champion? Well, you need approximately 1.4 billion Chinese/South-Korean/Japanese people to produce a Ke Jie or Lee Sedol as an upper bound (total population), around a hundred million as an intermediate point (approximate magnitude of number of people who know how to play), and tens of thousands of people as a lower bound (being as conservative as possible, by only counting a few years of highly-motivated insei children enrollment in full-time Go schools with the explicit goal of becoming Go professionals and even world champ, almost all of whom will fail to do so because they just can’t hack it). And then the world champ loses after 5-10 years, max, because they get old and rusty, and now you throw out all that compute, and churn again for a replacement; meanwhile, the MuZero agent remains as pristinely superior as the day it was trained, and costs you 0 FLOPS to ‘train a replacement’. This all holds true for chess and shogi as well, just with smaller numbers. So, whether that’s a >10,000× or >1,400,000,000× multiplication of the cost, it’s non-trivial, and unflattering to human brain efficiency estimates.
So, if you really think we should only be comparing training budgets to single brain-lifetimes, then I would be curious to hear how you think you can train a human Go world champ who can be useful for as long as a DL world champ, using only a single brain-lifetime.
Considering what a country can do a in a decade does make sense. But it is still relatively close compared to multiple millenia evolutionary timescales.
On that “country level” we should also consider for the model hyperparameter tuning and such. It is not super decisive but it is not like we precommit to use that single run if it is not maximally competent on what we think the method can provide.
Humans produce go professionals as a side product or one mode of answering the question of life. Even quite strict go professionals do stuff like prepare meals, file taxes and watch television. It might be unethical to set a scenario to test for single task performance of human brains. “Do go really well and a passable job at stereoscopic 3d vision” is a different task than just “Do go really well”. Humans being able to do ImageNet classfications without knowing to prepare for that specific task is quite a lot more than just having the capability. In contrast most models get an environment or data that is very pointedly shaped/helpful for their target task.
Human filtering is also pretty much calibrated on human ability levels ie a good painter is a good human painter. Thus the “miss rate” based on trying to gather the cream of the cream doesn’t really tell that it would be a generally unreliable method.
I’m not sure what you mean here. If you want to incorporate all of the evolution before that into that multiplier of ‘1.4 billion’, so it’s thousands of times that, that doesn’t make human brains look any more efficient.
All of those are costs and disadvantages to the debit of human Go FLOPS budgets; not credits or advantages.
Sure, but that is a fixed cost which is now in the past, and need never be done again. The MuZero code is written, and the hyperparameters are done. They are amortized over every year that the MuZero trained model exists, so as humans turn over at the same cost every era, the DL R&D cost approaches zero and becomes irrelevant. (Not that it was ever all that large, since the total compute budget for such research tends to be more like 10-100x the final training run, and can be <1x in scaling research where one pilots tiny models before the final training run: T5 or GPT-3 did that. So, irrelevant compared to the factors we are talking about like >>10,000x.)
But not one that anyone has set, or paid for, or cares even the slightest about whether Lee Sedol can see stereoscopic 3D images.
I think you are greatly overrating human knowledge of the 117 dog breeds in ImageNet, and in any case, zero-shot ImageNet is pretty good these days.
Again, a machine advantage and a human disadvantage.
I don’t know what you mean by this. The machines either do or do not pass the thresholds that varying numbers of humans fail to pass; of course you can have floor effects where the tasks are so easy that every human and machine can do it, and so there is no human penalty multiplier, but there are many tasks of considerable interest where that is obviously not the case and the human inefficiency is truly exorbitant and left out of your analysis. Chess, Go, Shogi, poetry, painting, these are all tasks that exist, and there are more, and will be more.
Are you actually talking about DALLE-2?
No. DALL-E 2 is not SOTA, so no point in citing some old system from almost half a year ago as the example.
Like I said, there’s plenty of uncertainty in FLOP/s. Maybe it’s helpful if rephrase this as an invitation for everyone to make their own modifications to the model.
Cotra’s lifetime anchor is 1029 FLOPs (so 4-5 OOMs above gradient descent). That’s still quite a chasm.
Do you mean including the CS brain activity towards the computed costs of training the model?
If you’re asking yourself whether or not you want to automate a certain role, then a practical subquestion is how much you have to spend on maintenance/fuel (i.e., electricity or food)? Then, I do think the acknowledging the source of the fuel becomes important.
Yes, I think that GPT-1 turning to GPT-2 and GPT-3 is the thing that is analogous with building brains out of new combinations of dna. Having an instance of GPT-3 to hone its weights and a single brain cutting and forming its connections are comparable. When doing fermi-estimates getting the ballpark wrong is pretty fatal as it is in the core of the activity. With that much conceptual confusion going on I don’t care about the numbers. To claim that other are making mistakes and not surviving a cursory look does not bode well for convincingness. I don’t care to get lured by pretty graphs to think my ignorance is more informed than it is.
If I know that the researches looks at the data until they find a correlation with p>0.05 that they found someting is not really significant news. Similarly if you keeping changing your viewpoint until you find an angle where orderings seem to reverse its less convincing that this one viewpoint is the one that matters.
Economically I would be interested in ability to change electricity to sugar and sugar to electricity. But because the end product is not the same the processes are not nearly economically interchangable. Go a long way in this direction and you measure everything in dollars. But typically when we care to specify that we care about energy efficiency and not example time efficiency we are going for more dimensions and more considerations rather than less.
To set terminology so that if gas prices go up then the energy efficiency of everything that uses gas goes down does not seem handy to me.