When using length-10 lists (it crushes length-5 no matter the prompt), I get:
32-shot, no fancy prompt: ~25%
0-shot, fancy python prompt: ~60%
0-shot, no fancy prompt: ~60%
So few-shot hurts, but the fancy prompt does not seem to help. Code here.
I’m interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I’m looking for counterexamples to the following conjecture: “fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting” (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).
List sorting does not play well with few-shot mostly doesn’t replicate with davinci-002.
When using length-10 lists (it crushes length-5 no matter the prompt), I get:
32-shot, no fancy prompt: ~25%
0-shot, fancy python prompt: ~60%
0-shot, no fancy prompt: ~60%
So few-shot hurts, but the fancy prompt does not seem to help. Code here.
I’m interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I’m looking for counterexamples to the following conjecture: “fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting” (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).