Two guesses on what’s going on with your experiences:
You’re asking for code which involves uncommon mathematics/statistics. In this case, progress on scicodebench is probably relevant, and it indeed shows remarkably slow improvement. (Many reasons for this, one relatively easy thing to try is to breakdown the task, forcing the model to write down the appropriate formal reasoning before coding anything. LMs are stubborn about not doing CoT for coding, even when it’s obviously appropriate IME)
You are underspecifying your tasks (and maybe your questions are more niche than average), or otherwise prompting poorly, in a way which a human could handle but models are worse at. In this case sitting down with someone doing similar tasks but getting more use out of LMs would likely help.
We did end up doing a version of this test. A problem came up in the course of our work which we wanted an LLM to solve (specifically, refactoring some numerical code to be more memory efficient). We brought in Ray, and Ray eventually concluded that the LLM was indeed bad at this, and it indeed seemed like our day-to-day problems were apparently of a harder-for-LLMs sort than he typically ran into in his day-to-day.
A thing unclear from the interaction: it had seemed towards the end that “build a profile to figure out where the bottleneck is” was one of the steps towards figuring out the problem, and that the LLM was (or might have been) better at that part. And, maybe models couldn’t solve you entire problem wholesale but there was still potential skills in identifying factorable pieces that were better fits for models.
Interesting! Two yet more interesting versions of the test:
Someone who currently gets use from LLMs writing more memory-efficient code, though maybe this is kind of question-begging
Someone who currently gets use from LLMs, and also is pretty familiar with trying to improve the memory efficiency of their code (which maybe is Ray, idk)
Two guesses on what’s going on with your experiences:
You’re asking for code which involves uncommon mathematics/statistics. In this case, progress on scicodebench is probably relevant, and it indeed shows remarkably slow improvement. (Many reasons for this, one relatively easy thing to try is to breakdown the task, forcing the model to write down the appropriate formal reasoning before coding anything. LMs are stubborn about not doing CoT for coding, even when it’s obviously appropriate IME)
You are underspecifying your tasks (and maybe your questions are more niche than average), or otherwise prompting poorly, in a way which a human could handle but models are worse at. In this case sitting down with someone doing similar tasks but getting more use out of LMs would likely help.
I would contribute to a bounty for y’all to do this. I would like to know whether the slow progress is prompting-induced or not.
We did end up doing a version of this test. A problem came up in the course of our work which we wanted an LLM to solve (specifically, refactoring some numerical code to be more memory efficient). We brought in Ray, and Ray eventually concluded that the LLM was indeed bad at this, and it indeed seemed like our day-to-day problems were apparently of a harder-for-LLMs sort than he typically ran into in his day-to-day.
A thing unclear from the interaction: it had seemed towards the end that “build a profile to figure out where the bottleneck is” was one of the steps towards figuring out the problem, and that the LLM was (or might have been) better at that part. And, maybe models couldn’t solve you entire problem wholesale but there was still potential skills in identifying factorable pieces that were better fits for models.
Interesting! Two yet more interesting versions of the test:
Someone who currently gets use from LLMs writing more memory-efficient code, though maybe this is kind of question-begging
Someone who currently gets use from LLMs, and also is pretty familiar with trying to improve the memory efficiency of their code (which maybe is Ray, idk)