Thanks, I’m grateful to you for reading the paper more closely than I did and reporting back interesting findings!
Their best model solved 34.2% on newly made test questions in their dataset, as compared to the prior state of the art of 1-5% on existing datasets.
Can you elaborate on this? Was the prior state of the art Codex? It sounds like you are saying Codex etc. solved 1-5% of various coding question datasets, and AlphaCode 41B solves 34.2% of their new coding question dataset, which is a huge leap forward if the datasets are comparably difficult (and in fact arguably the new dataset is more difficult?). Is this what you are saying? Where did you see that in the paper? I only skimmed it but the best I found was Figure 10 which didn’t quite contain the info I wanted.
See the second paragraph from the bottom of the paper. What they’re saying is that their model solves 34.2% of questions on the dataset they built for the model. Old models couldn’t have been tested on this, as they predate it. Instead, they were tested on tje APPS benchmark (amongst others). If you want to compare it to old models in a fair way, you need to test it on APPS. Which they did, though owing to dataset differences they couldn’t use certain parts of their methods. That’s what you see in Table 10.
I’ll try to cover this in the post, but that’s probably going to be a couple of hours.
Thanks, I’m grateful to you for reading the paper more closely than I did and reporting back interesting findings!
Can you elaborate on this? Was the prior state of the art Codex? It sounds like you are saying Codex etc. solved 1-5% of various coding question datasets, and AlphaCode 41B solves 34.2% of their new coding question dataset, which is a huge leap forward if the datasets are comparably difficult (and in fact arguably the new dataset is more difficult?). Is this what you are saying? Where did you see that in the paper? I only skimmed it but the best I found was Figure 10 which didn’t quite contain the info I wanted.
See the second paragraph from the bottom of the paper. What they’re saying is that their model solves 34.2% of questions on the dataset they built for the model. Old models couldn’t have been tested on this, as they predate it. Instead, they were tested on tje APPS benchmark (amongst others). If you want to compare it to old models in a fair way, you need to test it on APPS. Which they did, though owing to dataset differences they couldn’t use certain parts of their methods. That’s what you see in Table 10.
I’ll try to cover this in the post, but that’s probably going to be a couple of hours.