How is it that bad at codeforces? I competed a few years ago, but in my time div 2 a and b were extremely simple, basically just “implement the described algorithm in code” and if you submitted them quickly (which I expect gpt-4 would excel in) it was easy to reach a significantly better rating than the one reported by this paper.
I hope they didn’t make a mistake by misunderstanding the codeforces rating system (codeforces only awards a fraction of the “estimated rating-current rating” after a competition, but it is possible to exactly calculate the rating equivalent to the given performance from the data provided if you know the details (which I forgot))
When searching the paper for the exact methodology (by ctrl-f’ing “codeforces”), I haven’t found anything.
I know. I skimmed the paper, and in it there is a table above the chart showing the results in the tasks for all models (as every model’s performance is below 5% in codeforces, on the chart they overlap). I replied to the comment I replied to because thematically it seemed the most appropriate (asking about task performance), sorry if my choice of where to comment was confusing.
From the table:
GPT-3.5′s codeforces rating is “260 (below 5%)”
GPT-4′s codeforces rating is “392 (below 5%)”
How is it that bad at codeforces? I competed a few years ago, but in my time div 2 a and b were extremely simple, basically just “implement the described algorithm in code” and if you submitted them quickly (which I expect gpt-4 would excel in) it was easy to reach a significantly better rating than the one reported by this paper.
I hope they didn’t make a mistake by misunderstanding the codeforces rating system (codeforces only awards a fraction of the “estimated rating-current rating” after a competition, but it is possible to exactly calculate the rating equivalent to the given performance from the data provided if you know the details (which I forgot))
When searching the paper for the exact methodology (by ctrl-f’ing “codeforces”), I haven’t found anything.
Codeforces is not marked as having a GPT-4 measurement on this chart. Yes, it’s a somewhat confusing chart.
I know. I skimmed the paper, and in it there is a table above the chart showing the results in the tasks for all models (as every model’s performance is below 5% in codeforces, on the chart they overlap). I replied to the comment I replied to because thematically it seemed the most appropriate (asking about task performance), sorry if my choice of where to comment was confusing.
From the table:
GPT-3.5′s codeforces rating is “260 (below 5%)”
GPT-4′s codeforces rating is “392 (below 5%)”