See the second paragraph from the bottom of the paper. What they’re saying is that their model solves 34.2% of questions on the dataset they built for the model. Old models couldn’t have been tested on this, as they predate it. Instead, they were tested on tje APPS benchmark (amongst others). If you want to compare it to old models in a fair way, you need to test it on APPS. Which they did, though owing to dataset differences they couldn’t use certain parts of their methods. That’s what you see in Table 10.
I’ll try to cover this in the post, but that’s probably going to be a couple of hours.
See the second paragraph from the bottom of the paper. What they’re saying is that their model solves 34.2% of questions on the dataset they built for the model. Old models couldn’t have been tested on this, as they predate it. Instead, they were tested on tje APPS benchmark (amongst others). If you want to compare it to old models in a fair way, you need to test it on APPS. Which they did, though owing to dataset differences they couldn’t use certain parts of their methods. That’s what you see in Table 10.
I’ll try to cover this in the post, but that’s probably going to be a couple of hours.