Seems very similar to the agent problem of active memory switching, carrying important information across context switches
Also note that it could just be instead of bullshit, finetuning is unreasonably effective, and so when you train models on an evaluation they actually get better on the things evaluated, which dominates over scaling.
So things with public benchmarks might just actually be easier to make models that are genuinely good at it. (For instance searching for data that helps 1B models learn it, then adding it to full size models as a solution for data quality issues)
Have tested if finetuning open models on your problems works? (It is my first thought, so I assume you had it too)
Your first two key challenges
Seems very similar to the agent problem of active memory switching, carrying important information across context switches
Also note that it could just be instead of bullshit, finetuning is unreasonably effective, and so when you train models on an evaluation they actually get better on the things evaluated, which dominates over scaling.
So things with public benchmarks might just actually be easier to make models that are genuinely good at it. (For instance searching for data that helps 1B models learn it, then adding it to full size models as a solution for data quality issues)
Have tested if finetuning open models on your problems works? (It is my first thought, so I assume you had it too)