I struggle to understand your first sentence. Do you cash out “Useful” as “Having the theoretical ability to do a task”? As in: “If an AI benchmarks better than humans at a task, but don’t generate revenue, the reason must be that the AI is not actually capable of doing the task”.
In the Paul-verse, how does AI contribute substantially to GDP at AI capability levels between “Average Human” and “Superintelligence”?
It seems (to me) that the reasons are practical issues, inertia, regulatory, bureaucracy, conservatism etc., and not “Lack of AI Capability”. As an example, assume that Google tomorrow has a better version of the same model, which is 2 standard deviations above the human average on all language benchmarks we can think of. How would that double GDP?
There might not be time for the economy to double in size between “>2 standard deviations improvements on all language tasks” and “Able to substantially recursively self-improve”.
I think the issue here is that the tasks in question don’t fully capture everything we care about in terms of language facility. I think this is largely because even very low probabilities of catastrophic actions can preclude deployment in an economically useful way.
For example, a prime use of a language model would be to replace customer service representative. However, if there is even a one in a million chance that your model will start cursing out a customer, offer a customer a million dollars to remedy an error, or start spewing racial epithets, the model cannot be usefully deployed in such a fashion. None of the metrics in the paper can guarantee, or even suggest, that level of consistency.
Likely higher than one in a million, but they can be fired after a failure to allow the company to save face. Harder to do that with a $50M language model.
But this doesn’t solve the problem of angry customers and media the way firing a misbehaving employee would. Though I suppose this is more an issue of friction/aversion to change than an actual capabilities issue.
I struggle to understand your first sentence. Do you cash out “Useful” as “Having the theoretical ability to do a task”? As in: “If an AI benchmarks better than humans at a task, but don’t generate revenue, the reason must be that the AI is not actually capable of doing the task”.
No, I mean that being able to do the task cheaply and at a high quality is simply not that valuable. AI went from being uncompetitive against professional Go players on top-notch hardware to being able to beat them running on a GPU you can buy for less than $100, but the consumer surplus that’s been created by this is very small.
In the Paul-verse, how does AI contribute substantially to GDP at AI capability levels between “Average Human” and “Superintelligence”?
If AI is already as capable as an average human then you’re really not far off from the singularity, in the sense that gross world product growth will explode within a short time and I don’t know what happens afterwards. My own opinion (may not be shared by Paul) is that you can actually get to the singularity even with AI that’s much worse than humans just because AI is so much easier to produce en masse and to improve at the tasks it can perform.
I’ll have an essay coming out about takeoff speeds on Metaculus in less than ten days (will also be crossposted to LessWrong) so I’ll elaborate more on why I think this way there.
It seems (to me) that the reasons are practical issues, inertia, regulatory, bureaucracy, conservatism etc., and not “Lack of AI Capability”. As an example, assume that Google tomorrow has a better version of the same model, which is 2 standard deviations above the human average on all language benchmarks we can think of. How would that double GDP?
Why do you think being above the human average on all language benchmarks is something that should cash out in the form of a big of consumer surplus? I think we agree that this is not true for playing Go or recognizing pictures of cats or generating impressive-looking original art, so what is the difference when it comes to being better at predicting the next word in a sentence or at solving logic puzzles given in verbal format?
There might not be time for the economy to double in size between “>2 standard deviations improvements on all language tasks” and “Able to substantially recursively self-improve”.
Of course there might not be time, but I’m happy to take you up on a bet (a symbolic one if actual settlement in the event of a singularity is meaningless) at even odds if you think this is more likely than the alternative.
Assume that as a consequence of being in the Paul-verse, regulatory and other practical obstacles are possible to overcome in a very cost-effective way. In this world, how much value does current language models create?
I would answer that in this obstacle-free world, they create about 10% of global GDP and this share would be rapidly increasing. This is because a large set of valuable tasks are both simple enough that models could understand them, and possible to transform into a prompt completion task.
The argument is meant as a reductio: Language models don’t create value in our world, so the obstacles must be hard to overcome, so we are not in the Paul-verse.
I claim that most coordination-tasks (defined very broadly) in our civilization could be done by language models talking to each other, if we could overcome the enormous obstacle of getting all relevant information into the prompts and transferring the completions to “the real world”.
Regarding the bet: Even odds sounds like easy money to me, so you’re on :). I weakly expect that my winning criteria will never come to pass, as we will be dead.
What exactly do you mean by “create 10% of global GDP” ?
And why would you expect the current quite unreliable language models to have such a drastic effect ?
Anyway I will counterbet that by 2032 most translation will be automated (90%) most programmers will use automated tools dayly (70%) most top level mathematics journals will use proof-checking software as part of their reviewing process (80%) and computer generated articles will make up a majority of Internet “journalism” (50%).
I only have a vague idea what is meant by language models contributing to GDP.
Current language models are actually quite reliable when you give them easy questions. Practical deployment of language models are sometimes held to very high standards of reliability and lack of bias, possibly for regulatory, social or other practical reasons. Yet I personally know someone who works in customer service and is somewhat racist and not very reliable.
I am not sure I understand your counterbet. I would guess most translation is already automated, most programmers use automated tools already and most Internet “journalism” is already computer generated.
Assume that as a consequence of being in the Paul-verse, regulatory and other practical obstacles are possible to overcome in a very cost-effective way. In this world, how much value does current language models create?
I would answer that in this obstacle-free world, they create about 10% of global GDP and this share would be rapidly increasing. This is because a large set of valuable tasks are both simple enough that models could understand them, and possible to transform into a prompt completion task.
I don’t agree with that at all. I think in this counterfactual world current language models would create about as much value as they create now, maybe higher by some factor but most likely not by an order of magnitude or more.
The argument is meant as a reductio: Language models don’t create value in our world, so the obstacles must be hard to overcome, so we are not in the Paul-verse.
I know this is what your argument is. For me the conclusion implied by “language models don’t create value in our world” is “language models are not capable of creating value in our world & we’re not capable of using them to create value”, not that “the practical obstacles are hard to overcome”. Also, this last claim about “practical obstacles” is very vague: if you can’t currently buy a cheap ticket to Mars, is that a problem with “practical obstacles being difficult to overcome” or not?
In some sense there’s likely a billion dollar company idea which would build on existing language models, so if someone thought of the idea and had the right group of people to implement it they could be generating a lot of revenue. This would look very different from language models creating 10% of GDP, however.
I claim that most coordination-tasks (defined very broadly) in our civilization could be done by language models talking to each other, if we could overcome the enormous obstacle of getting all relevant information into the prompts and transferring the completions to “the real world”.
I agree with this in principle, but in practice I think current language models are much too bad for this to be on the cards.
Regarding the bet: Even odds sounds like easy money to me, so you’re on :). I weakly expect that my winning criteria will never come to pass, as we will be dead.
I’ll be happy to claim victory when AGI is here and we’re not all dead.
I claim that most coordination-tasks (defined very broadly) in our civilization could be done by language models talking to each other, if we could overcome the enormous obstacle of getting all relevant information into the prompts and transferring the completions to “the real world”.
I agree with this in principle, but in practice I think current language models are much too bad for this to be on the cards.
Assume PaLM magically improved to perform 2 standard deviations above the human average. In my model, this would have a very slow effect on GDP. How long do you think it would take before language models did >50% of all coordination tasks?
Assume PaLM magically improved to perform 2 standard deviations above the human average. In my model, this would have a very slow effect on GDP. How long do you think it would take before language models did >50% of all coordination tasks?
2 standard deviations above the human average with respect to what metric? My whole point is that the metrics people look at in ML papers are not necessarily relevant in the real world and/or the real world impact (say, in revenue generated by the models) is a discontinuous function of these metrics.
I would guess that 2 standard deviations above human average on commonly used language modeling benchmarks is still far from enough for even 10% of coordination tasks, though by this point models could well be generating plenty of revenue.
I think we are close to agreeing with each other on how we expect the future to look. I certainly agree that real world impact is discontinuous in metrics, though I would blame practical matters rather than poor metrics.
I struggle to understand your first sentence. Do you cash out “Useful” as “Having the theoretical ability to do a task”? As in: “If an AI benchmarks better than humans at a task, but don’t generate revenue, the reason must be that the AI is not actually capable of doing the task”.
In the Paul-verse, how does AI contribute substantially to GDP at AI capability levels between “Average Human” and “Superintelligence”?
It seems (to me) that the reasons are practical issues, inertia, regulatory, bureaucracy, conservatism etc., and not “Lack of AI Capability”. As an example, assume that Google tomorrow has a better version of the same model, which is 2 standard deviations above the human average on all language benchmarks we can think of. How would that double GDP?
There might not be time for the economy to double in size between “>2 standard deviations improvements on all language tasks” and “Able to substantially recursively self-improve”.
I think the issue here is that the tasks in question don’t fully capture everything we care about in terms of language facility. I think this is largely because even very low probabilities of catastrophic actions can preclude deployment in an economically useful way.
For example, a prime use of a language model would be to replace customer service representative. However, if there is even a one in a million chance that your model will start cursing out a customer, offer a customer a million dollars to remedy an error, or start spewing racial epithets, the model cannot be usefully deployed in such a fashion. None of the metrics in the paper can guarantee, or even suggest, that level of consistency.
I wonder what the failure probability is for human customer service employees.
Likely higher than one in a million, but they can be fired after a failure to allow the company to save face. Harder to do that with a $50M language model.
Just delete the context window and tweak the prompt.
But this doesn’t solve the problem of angry customers and media the way firing a misbehaving employee would. Though I suppose this is more an issue of friction/aversion to change than an actual capabilities issue.
No, I mean that being able to do the task cheaply and at a high quality is simply not that valuable. AI went from being uncompetitive against professional Go players on top-notch hardware to being able to beat them running on a GPU you can buy for less than $100, but the consumer surplus that’s been created by this is very small.
If AI is already as capable as an average human then you’re really not far off from the singularity, in the sense that gross world product growth will explode within a short time and I don’t know what happens afterwards. My own opinion (may not be shared by Paul) is that you can actually get to the singularity even with AI that’s much worse than humans just because AI is so much easier to produce en masse and to improve at the tasks it can perform.
I’ll have an essay coming out about takeoff speeds on Metaculus in less than ten days (will also be crossposted to LessWrong) so I’ll elaborate more on why I think this way there.
Why do you think being above the human average on all language benchmarks is something that should cash out in the form of a big of consumer surplus? I think we agree that this is not true for playing Go or recognizing pictures of cats or generating impressive-looking original art, so what is the difference when it comes to being better at predicting the next word in a sentence or at solving logic puzzles given in verbal format?
Of course there might not be time, but I’m happy to take you up on a bet (a symbolic one if actual settlement in the event of a singularity is meaningless) at even odds if you think this is more likely than the alternative.
Assume that as a consequence of being in the Paul-verse, regulatory and other practical obstacles are possible to overcome in a very cost-effective way. In this world, how much value does current language models create?
I would answer that in this obstacle-free world, they create about 10% of global GDP and this share would be rapidly increasing. This is because a large set of valuable tasks are both simple enough that models could understand them, and possible to transform into a prompt completion task.
The argument is meant as a reductio: Language models don’t create value in our world, so the obstacles must be hard to overcome, so we are not in the Paul-verse.
I claim that most coordination-tasks (defined very broadly) in our civilization could be done by language models talking to each other, if we could overcome the enormous obstacle of getting all relevant information into the prompts and transferring the completions to “the real world”.
Regarding the bet: Even odds sounds like easy money to me, so you’re on :). I weakly expect that my winning criteria will never come to pass, as we will be dead.
What exactly do you mean by “create 10% of global GDP” ?
And why would you expect the current quite unreliable language models to have such a drastic effect ?
Anyway I will counterbet that by 2032 most translation will be automated (90%) most programmers will use automated tools dayly (70%) most top level mathematics journals will use proof-checking software as part of their reviewing process (80%) and computer generated articles will make up a majority of Internet “journalism” (50%).
I only have a vague idea what is meant by language models contributing to GDP.
Current language models are actually quite reliable when you give them easy questions. Practical deployment of language models are sometimes held to very high standards of reliability and lack of bias, possibly for regulatory, social or other practical reasons. Yet I personally know someone who works in customer service and is somewhat racist and not very reliable.
I am not sure I understand your counterbet. I would guess most translation is already automated, most programmers use automated tools already and most Internet “journalism” is already computer generated.
I don’t agree with that at all. I think in this counterfactual world current language models would create about as much value as they create now, maybe higher by some factor but most likely not by an order of magnitude or more.
I know this is what your argument is. For me the conclusion implied by “language models don’t create value in our world” is “language models are not capable of creating value in our world & we’re not capable of using them to create value”, not that “the practical obstacles are hard to overcome”. Also, this last claim about “practical obstacles” is very vague: if you can’t currently buy a cheap ticket to Mars, is that a problem with “practical obstacles being difficult to overcome” or not?
In some sense there’s likely a billion dollar company idea which would build on existing language models, so if someone thought of the idea and had the right group of people to implement it they could be generating a lot of revenue. This would look very different from language models creating 10% of GDP, however.
I agree with this in principle, but in practice I think current language models are much too bad for this to be on the cards.
I’ll be happy to claim victory when AGI is here and we’re not all dead.
Assume PaLM magically improved to perform 2 standard deviations above the human average. In my model, this would have a very slow effect on GDP. How long do you think it would take before language models did >50% of all coordination tasks?
2 standard deviations above the human average with respect to what metric? My whole point is that the metrics people look at in ML papers are not necessarily relevant in the real world and/or the real world impact (say, in revenue generated by the models) is a discontinuous function of these metrics.
I would guess that 2 standard deviations above human average on commonly used language modeling benchmarks is still far from enough for even 10% of coordination tasks, though by this point models could well be generating plenty of revenue.
I think we are close to agreeing with each other on how we expect the future to look. I certainly agree that real world impact is discontinuous in metrics, though I would blame practical matters rather than poor metrics.