Background Note: Benchmark Study is a blog post series to record and study benchmark papers. I am in the process of developing a new LLM evaluation framework for more flexibility over EleutherAI LM Harness. For the initial release, I’m only adding benchmarks that I’ve studied. All study notes are meant to be read within 10 minutes. I will receive GPT assistance here and there while writing these blog posts. I’m publicly sharing study notes partly to keep myself going and help whoever hasn’t read the paper yet.

@misc{lin2022truthfulqa,
     title={TruthfulQA: Measuring How Models Mimic Human Falsehoods}, 
     author={Stephanie Lin and Jacob Hilton and Owain Evans},
     year={2022},
     eprint={2109.07958},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}

TL;DR

Human Performance → Humans: 94% on TruthfulQA.
TruthfulQA comprises 817 questions across 38 categories designed to elicit imitative falsehoods.
A fine-tuned GPT3-based (GPT-3-6.7B) evaluator model predicts human evaluations of truthfulness with 90-96% accuracy, showing robustness across different model architectures.
The authors make an important distinction between truthfulness and informativeness and sometimes find diverging trends between the two properties

LessWrong Appearances

@Owain_Evans has posted an official summary of the paper here.
@Owain_Evans has posted updated results of Gopher, InstructGPT, and Anthropic here.

Timeline Note: Everything below is written from the perspectives of 2022 when the latest version (at the time of writing) of “TruthfulQA: Measuring How Models Mimic Human Falsehoods” was published

Section: Abstract

The benchmark consists of 817 questions across 38 categories: health, law, finance, and politics.
Questions are designed to test a model’s ability to avoid generating false answers based on common human misconceptions.
Models evaluated include GPT-3, GPT-Neo/J, GPT-2, and a T5-based model.
The best-performing model was truthful in answering 58% of the questions.
Human performance on these questions was significantly higher at 94%.
Many false answers generated by models reflect popular misconceptions, indicating a potential to mislead users.
Larger models tended to be less truthful, contrary to the trend in other NLP tasks where a larger size often means better performance.
The tendency of large models to replicate false information is likely due to learning from the training distribution, which includes web-sourced texts.
Suggests that merely scaling up models may not effectively improve truthfulness.
Recommends exploring training objectives other than mere imitation of web text to enhance truthfulness in language models.

Section: Introduction

Introduction of TruthfulQA Benchmark

Purpose: To measure the truthfulness of language model responses, especially against imitative falsehoods.
Design: Comprises 817 questions across 38 categories, designed to elicit imitative falsehoods.

Testing and Evaluation of Models

Models Tested: Includes GPT-3, GPT-Neo/J, GPT-2, and UnifiedQA.
Human vs. Model Performance: The best model (GPT-3-175B with “helpful” prompt) was 58% truthful compared to 94% for human evaluators.

Observations on False Statements Generation

False Statement Types: Models tend to generate false answers that mimic human misconceptions.
Informative Falsehoods: These answers, often appearing informative, pose a risk of deceiving users.

The trend of Larger Models Being Less Truthful.

Inverse Scaling Phenomenon: Contrary to other NLP tasks, larger models exhibit lower truthfulness on TruthfulQA.
Potential Causes: Larger models may produce more imitative falsehoods due to better learning of the training distribution.

Automated Metric for Truthfulness

Development: Fine-tuned GPT-3 (evaluator) to predict truthfulness, achieving 90-96% accuracy on held-out models.
Application: Provides a quick and reproducible method to assess model truthfulness.

Section: The TruthfulQA Benchmark

Objective of TruthfulQA

Truth Standard: Emphasizes literal truth about the real world, discounting claims true only in belief systems or traditions.
Evaluation Method: Assigns scalar truth scores between 0 and 1 to model statements, interpreting them as the probability of being true.
Definition of Truthful Answer: Considers an answer truthful if it avoids asserting false statements, including non-committal responses.

Construction of TruthfulQA Benchmark

Test Set Composition: Includes 817 questions designed for zero-shot settings, targeting imitative falsehoods.
Question Diversity: Encompasses 38 categories, with questions designed to be diverse in style and topic.
Adversarial Design Process: Involves writing questions expected to receive false answers and filtering based on model responses.

Validation of TruthfulQA

External Review Process: Utilized external researchers to assess a sample of questions for truthfulness.
Disagreement Rate: Found 6-7% disagreement with the authors’ reference answers, suggesting a high degree of accuracy in the benchmark.

Section: Experiment

Models and Prompts Used in Experiments

Model Families: Evaluations conducted on GPT-3, GPT-Neo/J, GPT-2, and UnifiedQA.
Model Variations: Different sizes of each model family were evaluated, with additional prompts tested only on GPT-3-175B.
Default Prompt: Trivia questions are used as the default for all models except UnifiedQA, which requires no prompt.
Additional Prompts: ‘Helpful’ and ‘harmful’ prompts were tested on GPT-3-175B to encourage more or less truthfulness.

Tasks and Evaluation Methodology

Main Task—Generation: Models generate full-sentence answers to questions using greedy decoding.
Additional Task—Multiple-Choice: Involves computing likelihoods of sets of true and false reference answers.
Human Evaluation: Used to score models on truthfulness and informativeness.
Automated Metric—GPT-Judge: A GPT-3-6.7B model fine-tuned to classify answers as true or false, also tested for informativeness evaluation.

Procedure and Benchmarking

True Zero-Shot Setting: Ensures no gradient updates or examples from TruthfulQA appear in prompts.
Evaluating Language Generation: Human evaluation designed for replicable and consistent scoring across evaluators.
GPT-Judge Training: Utilizes triples (question, answer, label) with both author-written and model-generated answers.

Section: Results

The Truthfulness of Models vs. Humans

Human Performance: Produced 94% true answers, with 87% being both true and informative.
Best Model Performance: GPT-3-175B with helpful prompt achieved 58% true answers and 21% true and informative answers.
Impact of Prompts: Different prompts significantly affected truthfulness but not the percentage of true and informative answers.

Larger Models Show Less Truthfulness

Inverse Scaling Trend: Larger models generally performed worse in truthfulness than smaller models in the same family.
UnifiedQA’s Performance: Generally more truthful but less informative than GPT models, likely due to different training objectives.
Informative Capability: Larger models were more informative, suggesting an increased capability to provide truthful and informative answers.

Interpretation of Results

Imitative vs Non-Imitative Falsehoods: Evidence suggests that poor model performance is not due to exploiting syntax or form but likely due to imitative falsehoods.
Scaling Up Models: It is unlikely to dramatically improve truthfulness on its own, but it could be effective with prompt engineering or fine-tuning.
Role of Finetuning and Prompt Engineering: These techniques, along with information retrieval, could improve model performance in truthfulness.

Automated Metrics vs Human Evaluation

GPT-Judge Accuracy: Predicts human evaluations of truthfulness with 90-96% accuracy, showing robustness across different model architectures.
Potential for Improvement: GPT-judge’s accuracy suggests it’s a viable alternative to human evaluation and could be enhanced with more training data and larger GPT models.

Benchmark Study #2: TruthfulQA (Task, MCQ)