RE GPT-3, etc. doing well on math problems: the key word in my response was “robustly”. I think there is a big qualitative difference between “doing a good job on a certain distribution of math problems” and “doing math (robustly)”. This could be obscured by the fact that people also make mathematical errors sometimes, but I think the type of errors is importantly different from those made by DNNs.
This is a distribution of math problems GPT-3 wasn’t finetuned on. Yet it’s able to few-shot generalize and perform well. This is an amazing level of robustness relative to 2018 deep learning systems. I don’t see why scaling and access to external tools (e.g. to perform long calculations) wouldn’t produce the kind of robustness you have in mind.
I think you’re moving the goal-posts, since before you mentioned “without external calculators”. I think external tools are likely to be critical to doing this, and I’m much more optimistic about that path to doing this kind of robust generalization. I don’t think that necessarily addresses concerns about how the system reasons internally, though, which still seems likely to be critical for alignment.
RE GPT-3, etc. doing well on math problems: the key word in my response was “robustly”. I think there is a big qualitative difference between “doing a good job on a certain distribution of math problems” and “doing math (robustly)”. This could be obscured by the fact that people also make mathematical errors sometimes, but I think the type of errors is importantly different from those made by DNNs.
This is a distribution of math problems GPT-3 wasn’t finetuned on. Yet it’s able to few-shot generalize and perform well. This is an amazing level of robustness relative to 2018 deep learning systems. I don’t see why scaling and access to external tools (e.g. to perform long calculations) wouldn’t produce the kind of robustness you have in mind.
I think you’re moving the goal-posts, since before you mentioned “without external calculators”. I think external tools are likely to be critical to doing this, and I’m much more optimistic about that path to doing this kind of robust generalization. I don’t think that necessarily addresses concerns about how the system reasons internally, though, which still seems likely to be critical for alignment.