I wonder if you wouldn’t get better results by just grabbing the top system prompts for similar tasks and hard-sticking them onto the models, maybe with a bit of manual prompt engineering, such that e.g. Gemini always sees a very explicit order to assume user error when something goes wrong.
So this would forego our ability to assess how well they do autonomously and make the scenario more similar to having custom or dynamic prompting per task. That is also interesting, but aims a different objective, I think.
The other issue is that open-ended real-world tasks are a bit unfair to give to production LLMs, on the basis that anything these LLMs could do on their own would’ve already been done by an enterprising human using the very same LLMs, but with the intent to succeed[1] rather than the intent to evaluate the model’s performance.
I’m not sure if just seeing how far they get is “unfair” but I agree they are much less likely to succeed indeed!
So this would forego our ability to assess how well they do autonomously and make the scenario more similar to having custom or dynamic prompting per task.
I considered this, but, in the wild, we’d expect to see LLMs using a baseline level of generic prompt engineering for the interfaces they have access to. I wouldn’t suggest per-task custom prompts, but looking at the SOTA for general scaffolding might get more true-to-life results.
Thanks!
So this would forego our ability to assess how well they do autonomously and make the scenario more similar to having custom or dynamic prompting per task. That is also interesting, but aims a different objective, I think.
I’m not sure if just seeing how far they get is “unfair” but I agree they are much less likely to succeed indeed!
I considered this, but, in the wild, we’d expect to see LLMs using a baseline level of generic prompt engineering for the interfaces they have access to. I wouldn’t suggest per-task custom prompts, but looking at the SOTA for general scaffolding might get more true-to-life results.