Jacob Pfau comments on johnswentworth’s Shortform

Jacob Pfau 6 Dec 2024 18:58 UTC
11 points
3
Two guesses on what’s going on with your experiences:
1. You’re asking for code which involves uncommon mathematics/statistics. In this case, progress on scicodebench is probably relevant, and it indeed shows remarkably slow improvement. (Many reasons for this, one relatively easy thing to try is to breakdown the task, forcing the model to write down the appropriate formal reasoning before coding anything. LMs are stubborn about not doing CoT for coding, even when it’s obviously appropriate IME)
2. You are underspecifying your tasks (and maybe your questions are more niche than average), or otherwise prompting poorly, in a way which a human could handle but models are worse at. In this case sitting down with someone doing similar tasks but getting more use out of LMs would likely help.
- kave 6 Dec 2024 19:52 UTC
  16 points
  2
  Parent
  In this case sitting down with someone doing similar tasks but getting more use out of LMs would likely help.
  I would contribute to a bounty for y’all to do this. I would like to know whether the slow progress is prompting-induced or not.
  - johnswentworth 1 May 2025 17:57 UTC
    6 points
    0
    Parent
    We did end up doing a version of this test. A problem came up in the course of our work which we wanted an LLM to solve (specifically, refactoring some numerical code to be more memory efficient). We brought in Ray, and Ray eventually concluded that the LLM was indeed bad at this, and it indeed seemed like our day-to-day problems were apparently of a harder-for-LLMs sort than he typically ran into in his day-to-day.
    - Raemon 1 May 2025 18:02 UTC
      6 points
      0
      Parent
      A thing unclear from the interaction: it had seemed towards the end that “build a profile to figure out where the bottleneck is” was one of the steps towards figuring out the problem, and that the LLM was (or might have been) better at that part. And, maybe models couldn’t solve you entire problem wholesale but there was still potential skills in identifying factorable pieces that were better fits for models.
    - kave 1 May 2025 18:00 UTC
      4 points
      0
      Parent
      Interesting! Two yet more interesting versions of the test:
      Someone who currently gets use from LLMs writing more memory-efficient code, though maybe this is kind of question-begging
      Someone who currently gets use from LLMs, and also is pretty familiar with trying to improve the memory efficiency of their code (which maybe is Ray, idk)