Background Note: Benchmark Study is a blog post series to record and study benchmark papers. I am in the process of developing a new LLM evaluation framework for more flexibility over EleutherAI LM Harness. For the initial release, I’m only adding benchmarks that I’ve studied. All study notes are meant to be read within 10 minutes. I will receive GPT assistance here and there while writing these blog posts. I’m publicly sharing study notes partly to keep myself going and help whoever hasn’t read the paper yet.

@misc{clark2018think,
     title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge}, 
     author={Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
     year={2018},
     eprint={1803.05457},
     archivePrefix={arXiv},
     primaryClass={cs.AI}
}

TL;DR

Dataset Overview: Consists of 7,787 natural science questions derived from standardized tests.
Challenge Partition: Divided into a Challenge Set (2,590 questions) and an Easy Set (5,197 questions), with the Challenge Set specifically designed to be difficult for simple algorithms.
Not trivially solvable by Information Retrieval: Scored near zero on Challenge set by design, with slightly above zero due to partial credits.
Retrieval Bias: Most algorithms rely on retrieval methods biased towards sentences similar to the question, hindering performance on questions requiring combining multiple facts.

Timeline Note: Everything below is written from the perspectives of 2018 when the latest version (at the time of writing) of “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge” was published

Section: Abstract

A new set of tools for AI research in advanced question answering, challenging current AI capabilities.
ARC Question Set: Divided into a Challenge Set and an Easy Set, focusing on natural, grade-school science questions.
Challenge Set: Comprises questions incorrectly answered by both retrieval-based and word co-occurrence algorithms.
Dataset Size: The ARC dataset is the largest of its kind in the public domain, containing 7,787 questions. (2018)
Baseline Testing: Several leading neural models from SQuAD and SNLI tasks were tested, but none significantly outperformed a random baseline.
ARC Corpus Release: Accompanying the challenge is a corpus of 14 million science sentences relevant to the task.
Baseline Model Implementations: Includes implementations of three neural baseline models tested on the ARC.
Community Challenge: ARC is presented as a challenge to the AI community to develop models that can outperform current baselines.

Section: Introduction

Introduction to the AI2 Reasoning Challenge (ARC)

Purpose: To challenge current AI capabilities in question-answering tasks beyond retrieval-style tasks, focusing on reasoning and deep text comprehension.
Limitation of Existing Datasets: Emphasizes that most current datasets do not encourage progress in reasoning-based questions.

Structure and Composition of the ARC Dataset

Dataset Overview: Consists of 7,787 natural science questions derived from standardized tests.
Challenge Partition: Divided into a Challenge Set (2,590 questions) and an Easy Set (5,197 questions), with the Challenge Set specifically designed to be difficult for simple algorithms.
Example Questions: Includes questions requiring advanced QA methods, not just simple retrieval or word correlation.

Additional Resources Released with ARC

ARC Corpus: A corpus of 14 million science-related sentences relevant to ARC questions.
Neural Baseline Models: Includes DecompAttn, BiDAF, and DGEM models, adaptations of neural models from SNLI and SQuAD, which perform well on the Easy Set but struggle with the Challenge Set.

Distinction from Previous Challenges

Differences from the 2016 Allen AI Science Challenge: Focuses on difficult questions, provides a science corpus, and ensures public availability of questions, corpus, and models.

Organization and Content Analysis of the Paper

Structure: Discuss related work, dataset collection and partitioning, problem analysis in the Challenge Set, ARC Corpus, and baseline performance.
Baseline Performance: Highlights that while some models excel on the Easy Set, none significantly outperform a random baseline on the Challenge Set.

Section: ARC Dataset

Overview of ARC Dataset

Composition: Consists of 7,787 science questions, primarily multiple-choice, sourced from various exams.
Challenge vs Easy Set: Divided into 2,590 hard questions (Challenge Set) and 5,197 easier questions (Easy Set), based on baseline solvers’ performance.

Question Characteristics

Grade Levels: Questions range from 3rd to 9th grade, implying a target age group of 8 to 13 years.
Question Complexity: Varied in length and structure, with a detailed breakdown of word and sentence counts in both Challenge and Easy Sets.

Identifying Challenge Questions

Criteria: Challenge questions are those incorrectly answered by both an Information Retrieval (IR) solver and a Pointwise Mutual Information (PMI) solver.
Example Comparisons: Illustrates with examples of how certain questions are categorized based on these solvers’ performances.

Question Types in ARC

Knowledge and Reasoning Types: Categorizes questions based on the types of knowledge (e.g., definition, basic facts) and reasoning (e.g., causal, algebraic) they require.
Category Distribution: Provides approximate proportions of different knowledge and reasoning types across the ARC Challenge Set.

Methodology for ARC Challenge Set Definition

A snippet of the paper, which I think contains an impressively intuitive yet effective logic.

Baseline Solvers’ Role: Explains how the IR and PMI solvers are used to filter questions for the Challenge Set.
Effectiveness of Filtering: Demonstrates the effectiveness of this approach with specific question examples and their treatment by the solvers.

Knowledge and Reasoning Styles in ARC

Classification of Question Types: Enumerates broad classes of knowledge and reasoning styles in the ARC Challenge Set.
Size Approximation of Categories: Provides a visual representation of the relative sizes of these categories, acknowledging the subjectivity in classifying question challenges.

Section: ARC Corpus

ARC Corpus

Description: A collection of 14M science-related sentences from the web, totaling 1.4GB of text.
Purpose: To provide a resource for tackling the ARC Challenge, with mentions of knowledge needed for the Challenge Questions.

Creation of ARC Corpus

Methodology: Utilized a major search engine to run a series of science-relevant search queries.
Templates and Topics: Used around 100 hand-written templates for 80 science topics, generating a comprehensive set of search queries.

Characteristics of ARC Corpus

Content Source: Top documents from each search were collected, de-duplicated, and content stripped to retain only the text.
Augmentation: Includes the AristoMini corpus, with dictionary definitions, Simple Wikipedia articles, and additional web-sourced science sentences.

Coverage and Relevance

Science Relevance: Approximately 75% of the documents were judged as science-relevant based on an informal analysis.
Question Vocabulary: Covers 99.8% of the ARC question vocabulary, indicating extensive coverage of the needed knowledge domain.

Challenges

Indirect Mentions: While relevant knowledge is present, it is often mentioned indirectly, posing a challenge in exploiting the corpus effectively.
Baseline Performance: Changing the corpus behind the IR solver to the ARC Corpus resulted in scores similar to random guessing, underscoring the challenges in question answering.

Utility

Informal Analysis: Sampled analysis suggests that the ARC Corpus contains knowledge relevant to about 95% of the ARC Challenge questions.
Evidence Identification: Provides distributed evidence relevant to questions, illustrating its potential as a starting point for advanced question-answering methods.

Section: Baseline / Results

Overview of Baseline Systems

Systems Tested: Included IR, PMI, Guess-all, TupleInference, DecompAttn, DGEM, DGEM-OpenIE, and BiDAF.
Scoring Rubric: Systems scored based on correct answer choice or a tie, including the correct answer.

IR and PMI Performance

IR (dataset definition): Scored near zero on Challenge set by design, with slightly above zero due to partial credits.
PMI (dataset definition): Similar performance to IR on the Challenge set, with marginally higher scores.

Advanced Neural Models

DecompAttn and DGEM: Adapted from neural entailment models for multiple-choice questions involving retrieval and entailment scoring.
BiDAF: Adapted from a direct answer system for SQuAD to multiple-choice QA, selecting answers based on overlap with answer spans.

Results on Challenge and Easy Sets

Challenge Set Performance: All algorithms scored close to or below the random baseline, indicating the high difficulty of the Challenge set.
Easy Set Performance: Scores generally ranged between 55% and 65%, highlighting the difference in nature and difficulty compared to the Challenge set.

Limitations and Observations

Retrieval Bias: Most algorithms rely on retrieval methods biased towards sentences similar to the question, hindering performance on questions requiring combining multiple facts.
Need for Advanced Methods: Indicates a need for more sophisticated retrieval strategies and methods for combining information from multiple facts.

Benchmark Study #4: AI2 Reasoning Challenge (Task(s), MCQ)