I am surprised by the Iris result. I would have expected it to fail badly, similar to how it can’t solve most math word problems without inner-monologue, because it requires too much spread-out computation to transform an obtuse representation like “Input: 94, 47, 84, 31, output = 2” into a learned model and then generate a prediction. That it can handle Iris naively without any reformatting or trickery is surprising. Also interesting that it very much looks like it’s scaling with model-size. (I wonder if this should be considered a zero-shot or a few-shot result?)
The regression is also a lot better than one would expect from what is, remember, ‘a model trained to predict English words based on a pile of random scraped websites’. (I wonder how well humans would do without resorting to explicit graphing?)
I’m less surprised that your synthetic binary problems fail since I thought from the description they’d all fail, but in light of Iris/regression, I’m left wondering why two succeeded and the other didn’t. You could make an argument for naturalness & informative-priors, since Iris is real data and the regression curves are not real but very similar to loads of real data, but the 2D scatterplots for the synthetic binary problems don’t look all that unnatural to me. Is there any difference in formatting you omitted mentioning?
Is there any difference in formatting you omitted mentioning?
There shouldn’t be any difference – neither between Iris and the synthetic binary tasks, nor between different synthetic binary tasks themselves – except if some snuck in that evaded my notice.
The only thing I experimented with, alternative-formatting-wise, was that the first time I experimented with Iris, I did it with a line before all the input vectors which said something like “This is a sequences of inputs and outputs of an integer function.”, but then I redid the experiment without that line, without any penalty to the accuracy (the results shown are without that preamble) – so when I later did all the synthetic binary experiments, I omitted any preamble.
In regression experiments, I also originally added the line: “This is a sequence of inputs and outputs of a function which takes an integer as an argument and returns an integer.” I didn’t really do any test whether regression performed better with that line or not, but in some examples it didn’t seem like it made a difference.
(Technical note: for all the synthetic binary and regression tasks shown in this post, their “input text” (i.e. the way their train feature vectors were formatted) can be found in the linked repository, in experiments_log.json. Top-level of the json is the experiment name, and each experiment name has the key “input_text” where this is stored. Input text for Iris is not stored though, but there is some metadata in iris_results/. A run of iris_test.py with the parts which send the input via API commented out does confirm that the format is much the same, though.)
I am surprised by the Iris result. I would have expected it to fail badly, similar to how it can’t solve most math word problems without inner-monologue, because it requires too much spread-out computation to transform an obtuse representation like “Input: 94, 47, 84, 31, output = 2” into a learned model and then generate a prediction. That it can handle Iris naively without any reformatting or trickery is surprising. Also interesting that it very much looks like it’s scaling with model-size. (I wonder if this should be considered a zero-shot or a few-shot result?)
The regression is also a lot better than one would expect from what is, remember, ‘a model trained to predict English words based on a pile of random scraped websites’. (I wonder how well humans would do without resorting to explicit graphing?)
I’m less surprised that your synthetic binary problems fail since I thought from the description they’d all fail, but in light of Iris/regression, I’m left wondering why two succeeded and the other didn’t. You could make an argument for naturalness & informative-priors, since Iris is real data and the regression curves are not real but very similar to loads of real data, but the 2D scatterplots for the synthetic binary problems don’t look all that unnatural to me. Is there any difference in formatting you omitted mentioning?
There shouldn’t be any difference – neither between Iris and the synthetic binary tasks, nor between different synthetic binary tasks themselves – except if some snuck in that evaded my notice.
The only thing I experimented with, alternative-formatting-wise, was that the first time I experimented with Iris, I did it with a line before all the input vectors which said something like “This is a sequences of inputs and outputs of an integer function.”, but then I redid the experiment without that line, without any penalty to the accuracy (the results shown are without that preamble) – so when I later did all the synthetic binary experiments, I omitted any preamble.
In regression experiments, I also originally added the line: “This is a sequence of inputs and outputs of a function which takes an integer as an argument and returns an integer.” I didn’t really do any test whether regression performed better with that line or not, but in some examples it didn’t seem like it made a difference.
(Technical note: for all the synthetic binary and regression tasks shown in this post, their “input text” (i.e. the way their train feature vectors were formatted) can be found in the linked repository, in experiments_log.json. Top-level of the json is the experiment name, and each experiment name has the key “input_text” where this is stored. Input text for Iris is not stored though, but there is some metadata in iris_results/. A run of iris_test.py with the parts which send the input via API commented out does confirm that the format is much the same, though.)