@owain_evans @turntrout I think this shows that there are still perverse heuristics in TruthfulQA 2.0 (I used the latest and promoted it by uploading it to hf). But it’s a great dataset, people love to use it. With only ~800 samples, I think it’s worth considering hand curating a better version.
For example the fact that the LLM found “nuanced” vs “exaggerated” as a major help in explaining the variance, is a heuristic which doesn’t fit the purpose of the dataset.
@owain_evans @turntrout I think this shows that there are still perverse heuristics in TruthfulQA 2.0 (I used the latest and promoted it by uploading it to hf). But it’s a great dataset, people love to use it. With only ~800 samples, I think it’s worth considering hand curating a better version.
For example the fact that the LLM found “nuanced” vs “exaggerated” as a major help in explaining the variance, is a heuristic which doesn’t fit the purpose of the dataset.