I suspect that if you ask the model to reconsider its answer, it would double down even on the incorrect (B-biased) responses. LLMs really like being self-consistent. We haven’t run this experiment, but if you do, let us know the result!
If I understand correctly, your proposed fix is something like supervised finetuning on adversarial examples that trigger the B-bias. We can access the output logits directly (replacing step 1) and the ground-truth answer is provided in the dataset (removing the need for step 2), so this seems relatively doable.
The main challenges that I see are 1) the computational cost of doing additional optimization (we had to do best-of-N optimization rather than updating the entire model to make our experiments manageable) and 2) it requires finetuning access (which often isn’t available for the latest models). But these challenges aren’t insurmountable, so I wonder why I haven’t seen finetuned “judges” more often.
supervised finetuning on adversarial examples that trigger the B-bias.
You’re correct though my thought was as a general policy, to have something check every answer for any kind of incorrectness or bias. For this situation, assuming you don’t have a database of ground truth (trivial case if you do) you need some method to get the most likely ground truth. You could ask multiple larger LLMs and train this one on the ‘more correct’ opinion from it’s peers.
But that may not be worth doing. I was thinking more of factual questions, legal briefs that reference a case number, claims about a website, and so on. Each of these has a ground truth it is possible to look up, and you need to fine tune the model automatically to produce logits with very high probability of the correct answer.
Also in this general case, the logits are useless, because you aren’t looking for the token for the answers A/B/P/N but a series of steps that lead to an answer, and you want both the reasoning in the steps to be valid, and the answer to be correct. The 1 letter answer is a trivial case.
So I found llms change wrong answers pretty often if the llm is gpt-4. Have not tried this one.
But these challenges aren’t insurmountable, so I wonder why I haven’t seen finetuned “judges” more often
What also are the other costs besides fine-tuning in terms of model performance on other tasks? 7B doesn’t leave much usable space, everything it learns comes at a cost.
I suspect that if you ask the model to reconsider its answer, it would double down even on the incorrect (B-biased) responses. LLMs really like being self-consistent. We haven’t run this experiment, but if you do, let us know the result!
If I understand correctly, your proposed fix is something like supervised finetuning on adversarial examples that trigger the B-bias. We can access the output logits directly (replacing step 1) and the ground-truth answer is provided in the dataset (removing the need for step 2), so this seems relatively doable.
The main challenges that I see are 1) the computational cost of doing additional optimization (we had to do best-of-N optimization rather than updating the entire model to make our experiments manageable) and 2) it requires finetuning access (which often isn’t available for the latest models). But these challenges aren’t insurmountable, so I wonder why I haven’t seen finetuned “judges” more often.
You’re correct though my thought was as a general policy, to have something check every answer for any kind of incorrectness or bias. For this situation, assuming you don’t have a database of ground truth (trivial case if you do) you need some method to get the most likely ground truth. You could ask multiple larger LLMs and train this one on the ‘more correct’ opinion from it’s peers.
But that may not be worth doing. I was thinking more of factual questions, legal briefs that reference a case number, claims about a website, and so on. Each of these has a ground truth it is possible to look up, and you need to fine tune the model automatically to produce logits with very high probability of the correct answer.
Also in this general case, the logits are useless, because you aren’t looking for the token for the answers A/B/P/N but a series of steps that lead to an answer, and you want both the reasoning in the steps to be valid, and the answer to be correct. The 1 letter answer is a trivial case.
So I found llms change wrong answers pretty often if the llm is gpt-4. Have not tried this one.
What also are the other costs besides fine-tuning in terms of model performance on other tasks? 7B doesn’t leave much usable space, everything it learns comes at a cost.