When using length-10 lists (it crushes length-5 no matter the prompt), I get:
32-shot, no fancy prompt: ~25%
0-shot, fancy python prompt: ~60%
0-shot, no fancy prompt: ~60%
So few-shot hurts, but the fancy prompt does not seem to help. Code here.
I’m interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I’m looking for counterexamples to the following conjecture: “fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting” (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).
I tried again with Mixtral 8x7B (base) and did not get meaningful gaps between 0-shot, few-shot and fancy prompts (no matter the list size, with n=100, all accuracies are within error bars of each others, with the expected trend of more shot is better if you squint). Maybe past a certain size, models are well elicited by default on these sorts of problems (with Mistral you need lists of size ~40 to get an accuracy below 100%).
More detailed results on davinci-002 with 2-sigma error bars (n=200):
List sorting does not play well with few-shot mostly doesn’t replicate with davinci-002.
When using length-10 lists (it crushes length-5 no matter the prompt), I get:
32-shot, no fancy prompt: ~25%
0-shot, fancy python prompt: ~60%
0-shot, no fancy prompt: ~60%
So few-shot hurts, but the fancy prompt does not seem to help. Code here.
I’m interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I’m looking for counterexamples to the following conjecture: “fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting” (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).
I tried again with Mixtral 8x7B (base) and did not get meaningful gaps between 0-shot, few-shot and fancy prompts (no matter the list size, with n=100, all accuracies are within error bars of each others, with the expected trend of more shot is better if you squint). Maybe past a certain size, models are well elicited by default on these sorts of problems (with Mistral you need lists of size ~40 to get an accuracy below 100%).
More detailed results on davinci-002 with 2-sigma error bars (n=200):