Cool, thanks. I agree that specifying the problem won’t get solved by itself. In particular, I don’t think that any jobs will become automated by describing the task and giving 10 examples to an insanely powerful language model. I realise that I haven’t been entirely clear on this (and indeed, my intuitions about this are still in flux). Currently, my thinking goes along the following lines:
Fine-tuning on a representative dataset is really, really powerful, and it gets more powerful the narrower the task is. Since most benchmarks are more narrow than the things we want to automate, and it’s easier to game more narrow benchmarks, I don’t trust trends based on narrow, fine-tuned benchmarks that much.
However, in a few-shot setting, there’s not enough data to game the benchmarks in an overly narrow way. Instead, they can be fairly treated as a sample from all possible questions you could ask the model. If the model can answer some superglue questions that seem reasonably difficult, then my default assumption is that it could also answer other natural language questions that seem similarly difficult.
This isn’t always an accurate way of predicting performance, because of our poor abilities to understand what questions are easy or hard for language models.
However, it seems like should at least be an unbiased prediction; I’m as likely to think that benchmark question A is harder than non-benchmark question B as I am to think that B is harder than A (for A, B that are in fact similarly hard for a language model).
However, when automating stuff in practice, there are two important problems that speak against using few-shot prompting:
As previously mentioned, tasks-to-be-automated are less narrow than the benchmarks. Prompting with examples seems less useful for less narrow situations, because each example may be much longer and/or you may need more prompts to cover the variation of situations.
Finetuning is in fact really powerful. You can probably automate stuff with finetuning long before you can automate it with few-shot prompting, and there’s no good reason to wait for models that can do the latter.
Thus, I expect that in practice, telling the model what to do will happen via finetuning (perhaps even in an RL-fashion directly from human feedback), and the purpose of the benchmarks is just to provide information about how capable the model is.
I realise this last step is very fuzzy, so to spell out a procedure somewhat more explicitly: When asking whether a task can be automated, I think you can ask something like “For each subtask, does it seem easier or harder than the ~solved benchmark tasks?” (optionally including knowledge about the precise nature of the benchmarks, e.g. that the model can generally figure out what an ambiguous pronoun refers to, or figure out if a stated hypothesis is entailed by a statement). Of course, a number of problem makes this pretty difficult:
It assumes some way of dividing tasks into a number of sub-tasks (including the subtask of figuring out what subtask the model should currently be trying to answer).
Insofar as that which we’re trying to automate is “farther away” from the task of predicting internet corpora, we should adjust for how much finetuning we’ll need to make up for that.
We’ll need some sense of how 50 in-prompt-examples showing the exact question-response format compares to 5000 (or more; or less) finetuning samples showing what to do in similar-but-not-exactly-the-same-situation.
Nevertheless, I have a pretty clear sense that if someone told me “We’ll reach near-optimal performance on benchmark X with <100 examples in 2022” I would update differently on ML progress than if they told me the same thing would happen in 2032; and if I learned this about dozens of benchmarks, the update would be non-trivial. This isn’t about “benchmarks” in particular, either. The completion of any task gives some evidence about the probability that a model can complete another task. Benchmarks are just the things that people spend their time recording progress on, so it’s a convenient list of tasks to look at.
for us to know the exact thing we want and precisely characterize it is basically the condition for something being subject to automation by traditional software. ML can come into play where the results don’t really matter that much, with things like search/retrieval, ranking problems,
I’m not sure what you’re trying to say here? My naive interpretation is that we only use ML when we can’t be bothered to write a traditional solution, but I don’t think you believe that. (To take a trivial example: ML can recognise birds far better than any software we can write.)
My take is that for us to know the exact thing we want and precisely characterize it is indeed the condition for writing traditional software; but for ML, it’s sufficient that we can recognise the exact thing that we want. There are many problems where we recognise success without having any idea about the actual steps needed to perform the task. Of course, we also need a model with sufficient capacity, and a dataset with sufficiently many examples of this task (or an environment where such a dataset can be produced on the fly, RL-style).
On the rest, I think the best way I can think of explaining this is in terms of alignment and not correctness.
My naive interpretation is that we only use ML when we can’t be bothered to write a traditional solution, but I don’t think you believe that. (To take a trivial example: ML can recognise birds far better than any software we can write.)
The bird example is good. My contention is basically that when it comes to making something like “recognizing birds” economically useful, there is an enormous chasm between 90% performance on a subset of ImageNet and money in the bank. For two reasons, among others:
Alignment. What do we mean by “recognize birds”? Do pictures of birds count? Cartoon birds? Do we need to identify individual organisms e.g. for counting birds? Are some kinds of birds excluded?
Engineering. Now that you have a module which can take in an image and output whether it has a bird in it, how do you produce value?
I’ll admit that this might seem easy to do, and that ML is doing pretty much all the heavy lifting here. But my take on that is it’s because object recognition/classification is a very low-level and automatic, sub-cognitive, thing. Once you start getting into questions of scene understanding, or indeed language understanding, there is an explosion of contingencies beyond silly things like cartoon birds. What humans are really really good at is understanding these (often unexpected) contingencies in the context of their job and business’s needs, and acting appropriately. At what point would you be willing to entrust an ML system to deal with entirely unexpected contingencies in a way that suits your business needs (and indeed, doesn’t tank them)? Even the highest level of robustness on known contingencies may not be enough, because almost certainly, the problem is fundamentally underspecifiedfrom the instructions and input data. And so, in order to successfully automate the task, you need to successfully characterize the full space of contingencies you want the worker to deal with, perhaps enforcing it by the architecture of your app or business model. And this is where the design, software engineering, and domain-specific understanding aspects come in. Because no matter how powerful our ML systems are, we only want to use them if they’re aligned (or if, say, we have some kind of bound on how pathologically they may behave, or whatever), and knowing that is in general very hard. More powerful ML does make the construction of such systems easier, but is in some way orthogonal to the alignment problem. I would make this more concrete but I’m tired so I hope the concrete examples I gave earlier in the discussion serve as inspiration enough.
And, yeah. I should also clarify that my position is in some way contingent on ML already being good enough to eat all kinds of stuff that 10 years ago would be unheard of. I don’t mean to dunk on ML’s economic value. But basically what I think is that a lot of pretty transformative AI is already here. The market has taken up a lot of it, but I’m sure there’s plenty more slack to be made up in terms of productivity gains from today’s (and yesterday’s) ML. This very well might result in a doubling of worker productivity, which we’ve seen many times before and which seems to meet some definition of “producing the majority of the economic value that a human is capable of.” Maybe if I had a better sense of the vision of “transformative AI” I would be able to see more clearly how ML progress relates to it. But again, even then I don’t necessarily see the connection to extrapolation on benchmarks, which are inherently just measuring sticks of their day and kind of separate from the economic questions.
Anyway, thanks for engaging. I’m probably going to duck out of responding further because of holiday and other duties, but I’ve enjoyed this exchange. It’s been a good opportunity to express, refine, & be challenged on my views. I hope you’ve felt that it’s productive as well.
This has definitely been productive for me. I’ve gained useful information, I see some things more clearly, and I’ve noticed some questions I still need to think a lot more about. Thanks for taking the time, and happy holidays!
Cool, thanks. I agree that specifying the problem won’t get solved by itself. In particular, I don’t think that any jobs will become automated by describing the task and giving 10 examples to an insanely powerful language model. I realise that I haven’t been entirely clear on this (and indeed, my intuitions about this are still in flux). Currently, my thinking goes along the following lines:
Fine-tuning on a representative dataset is really, really powerful, and it gets more powerful the narrower the task is. Since most benchmarks are more narrow than the things we want to automate, and it’s easier to game more narrow benchmarks, I don’t trust trends based on narrow, fine-tuned benchmarks that much.
However, in a few-shot setting, there’s not enough data to game the benchmarks in an overly narrow way. Instead, they can be fairly treated as a sample from all possible questions you could ask the model. If the model can answer some superglue questions that seem reasonably difficult, then my default assumption is that it could also answer other natural language questions that seem similarly difficult.
This isn’t always an accurate way of predicting performance, because of our poor abilities to understand what questions are easy or hard for language models.
However, it seems like should at least be an unbiased prediction; I’m as likely to think that benchmark question A is harder than non-benchmark question B as I am to think that B is harder than A (for A, B that are in fact similarly hard for a language model).
However, when automating stuff in practice, there are two important problems that speak against using few-shot prompting:
As previously mentioned, tasks-to-be-automated are less narrow than the benchmarks. Prompting with examples seems less useful for less narrow situations, because each example may be much longer and/or you may need more prompts to cover the variation of situations.
Finetuning is in fact really powerful. You can probably automate stuff with finetuning long before you can automate it with few-shot prompting, and there’s no good reason to wait for models that can do the latter.
Thus, I expect that in practice, telling the model what to do will happen via finetuning (perhaps even in an RL-fashion directly from human feedback), and the purpose of the benchmarks is just to provide information about how capable the model is.
I realise this last step is very fuzzy, so to spell out a procedure somewhat more explicitly: When asking whether a task can be automated, I think you can ask something like “For each subtask, does it seem easier or harder than the ~solved benchmark tasks?” (optionally including knowledge about the precise nature of the benchmarks, e.g. that the model can generally figure out what an ambiguous pronoun refers to, or figure out if a stated hypothesis is entailed by a statement). Of course, a number of problem makes this pretty difficult:
It assumes some way of dividing tasks into a number of sub-tasks (including the subtask of figuring out what subtask the model should currently be trying to answer).
Insofar as that which we’re trying to automate is “farther away” from the task of predicting internet corpora, we should adjust for how much finetuning we’ll need to make up for that.
We’ll need some sense of how 50 in-prompt-examples showing the exact question-response format compares to 5000 (or more; or less) finetuning samples showing what to do in similar-but-not-exactly-the-same-situation.
Nevertheless, I have a pretty clear sense that if someone told me “We’ll reach near-optimal performance on benchmark X with <100 examples in 2022” I would update differently on ML progress than if they told me the same thing would happen in 2032; and if I learned this about dozens of benchmarks, the update would be non-trivial. This isn’t about “benchmarks” in particular, either. The completion of any task gives some evidence about the probability that a model can complete another task. Benchmarks are just the things that people spend their time recording progress on, so it’s a convenient list of tasks to look at.
I’m not sure what you’re trying to say here? My naive interpretation is that we only use ML when we can’t be bothered to write a traditional solution, but I don’t think you believe that. (To take a trivial example: ML can recognise birds far better than any software we can write.)
My take is that for us to know the exact thing we want and precisely characterize it is indeed the condition for writing traditional software; but for ML, it’s sufficient that we can recognise the exact thing that we want. There are many problems where we recognise success without having any idea about the actual steps needed to perform the task. Of course, we also need a model with sufficient capacity, and a dataset with sufficiently many examples of this task (or an environment where such a dataset can be produced on the fly, RL-style).
Re: how to update based on benchmark progress in general, see my response to you above.
On the rest, I think the best way I can think of explaining this is in terms of alignment and not correctness.
The bird example is good. My contention is basically that when it comes to making something like “recognizing birds” economically useful, there is an enormous chasm between 90% performance on a subset of ImageNet and money in the bank. For two reasons, among others:
Alignment. What do we mean by “recognize birds”? Do pictures of birds count? Cartoon birds? Do we need to identify individual organisms e.g. for counting birds? Are some kinds of birds excluded?
Engineering. Now that you have a module which can take in an image and output whether it has a bird in it, how do you produce value?
I’ll admit that this might seem easy to do, and that ML is doing pretty much all the heavy lifting here. But my take on that is it’s because object recognition/classification is a very low-level and automatic, sub-cognitive, thing. Once you start getting into questions of scene understanding, or indeed language understanding, there is an explosion of contingencies beyond silly things like cartoon birds. What humans are really really good at is understanding these (often unexpected) contingencies in the context of their job and business’s needs, and acting appropriately. At what point would you be willing to entrust an ML system to deal with entirely unexpected contingencies in a way that suits your business needs (and indeed, doesn’t tank them)? Even the highest level of robustness on known contingencies may not be enough, because almost certainly, the problem is fundamentally underspecified from the instructions and input data. And so, in order to successfully automate the task, you need to successfully characterize the full space of contingencies you want the worker to deal with, perhaps enforcing it by the architecture of your app or business model. And this is where the design, software engineering, and domain-specific understanding aspects come in. Because no matter how powerful our ML systems are, we only want to use them if they’re aligned (or if, say, we have some kind of bound on how pathologically they may behave, or whatever), and knowing that is in general very hard. More powerful ML does make the construction of such systems easier, but is in some way orthogonal to the alignment problem. I would make this more concrete but I’m tired so I hope the concrete examples I gave earlier in the discussion serve as inspiration enough.
And, yeah. I should also clarify that my position is in some way contingent on ML already being good enough to eat all kinds of stuff that 10 years ago would be unheard of. I don’t mean to dunk on ML’s economic value. But basically what I think is that a lot of pretty transformative AI is already here. The market has taken up a lot of it, but I’m sure there’s plenty more slack to be made up in terms of productivity gains from today’s (and yesterday’s) ML. This very well might result in a doubling of worker productivity, which we’ve seen many times before and which seems to meet some definition of “producing the majority of the economic value that a human is capable of.” Maybe if I had a better sense of the vision of “transformative AI” I would be able to see more clearly how ML progress relates to it. But again, even then I don’t necessarily see the connection to extrapolation on benchmarks, which are inherently just measuring sticks of their day and kind of separate from the economic questions.
Anyway, thanks for engaging. I’m probably going to duck out of responding further because of holiday and other duties, but I’ve enjoyed this exchange. It’s been a good opportunity to express, refine, & be challenged on my views. I hope you’ve felt that it’s productive as well.
This has definitely been productive for me. I’ve gained useful information, I see some things more clearly, and I’ve noticed some questions I still need to think a lot more about. Thanks for taking the time, and happy holidays!