Research Post: Tasks That Language Models Don’t Learn

Link post

Abstract

We argue that there are certain properties of language that our current large language models (LLMs) don’t learn. We present an empirical investigation of visual-auditory properties of language through a series of tasks, termed H-Test. This benchmark highlights a fundamental gap between human linguistic comprehension, which naturally integrates sensory experiences, and the sensory-deprived processing capabilities of LLMs. In support of our hypothesis, 1. deliberate reasoning (Chain-of-Thought), 2. few-shot examples, or 3. stronger LLM from the same model family (LLaMA 2 13B → LLaMA 2 70B) do not trivially bring improvements in H-Test performance.

Therefore, we make a particular connection to the philosophical case of Mary, who learns about the world in a sensory-deprived environment. Our experiments show that some of the strongest proprietary LLMs stay near random chance baseline accuracy of 50%, highlighting the limitations of knowledge acquired in the absence of sensory experience.

Key Findings on H-Test

1. Insignificant Intra-Family Improvements.

  • A stronger model in the same model family often does not bring meaningful improvement in H-Test performance.

2. Number of Examples Has Minimal Impact.

  • The number of examples given neither increases nor decreases performance significantly, strongly hinting that the LLM is simply not learning from H-Test few-shot examples.

3. Deliberate Reasoning (CoT) Often Decreases Performance.

  • If LLMs benefit from such logical, step-by-step semantic reasoning on the H-Test, this can also imply that H-Test is fundamentally solvable by developing stronger language-only models. But CoT decreases performances in general.

4. Training with more orthography-specific language data does not improve H-Test.

  • We produced 1000 training instances per task in H-Test and fine-tuned gpt-3.5-turbo-0613 ten different times accordingly. After training for three epochs on each task, we evaluated them on H-Test at k = 50 and observed that no significant performance improvement was achieved.

5. Multi-modality does not automatically improve H-Test performance.

  • At the time of writing, LLaVA V1.6 34B is the strongest open-source multi-modal model available. Despite the addition of visual modality, we observe that simply incorporating visual data into the training does not result in a straightforward improvement in H-Test performance.

6 (Important). But the H-Test is solvable, we just don’t know how.

  • In our paper, we have reported the seemingly unexplainable (jumping) performance improvement on H-Test from GPT-3.5 to GPT-4. This result is important as it shows that H-Test is indeed solvable (by a GPT-4-level system), but not through conventionally discussed language-only modeling techniques.

Acknowledgments and Links

I was inspired to do this research due to examples cases given in @Owain_Evans track application for the Constellation Fellowship. This research is not financially supported by a university lab, and Walnut Research is an independent research organization that runs on personal funding from myself and a few of my friends.

ArXiv: https://​​arxiv.org/​​abs/​​2402.11349

GitHub: https://​​github.com/​​brucewlee/​​H-Test