Dumping more of the paper’s contents in the hope that it encourages people to look at the paper in more detail:
As GPT-4 ’s development continued after our experiments, one should expect different responses from the final version of GPT4. In particular, all quantitative results should be viewed as estimates of the model’s potential, rather than definitive numbers. We repeat this caveat throughout the paper to clarify that the experience on the deployed model may differ. Moreover we emphasize that the version we tested was text-only for inputs, but for simplicity we refer to it as GPT-4 too
I’m honestly stunned by this. If it was indeed trained solely on text, how does it end up with such a good idea of how Euclidean space works? That’s either stupidly impressive, or a possible hint that the set of natural abstractions is even smaller and a bigger attractor in algorithm space than I thought. The labyrinth seems explicable, but the graphics?
Thanks, I did not know this. A quick search for his images seems to show that they use colour and perspective right at least as well as this does. Provided this is fully real and there’s nobody else in his process choosing colors and such. Tentatively marking this down as a win for natural abstraction.
Dumping more of the paper’s contents in the hope that it encourages people to look at the paper in more detail:
1 Introduction 4
1.1 Our approach to studying GPT-4’s intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Organization of our demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Multimodal and interdisciplinary composition 13
2.1 Integrative ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Image generation beyond memorization . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Image generation following detailed instructions (`a la Dall-E) . . . . . . . . . . . . . . 17
2.2.3 Possible application in sketch generation . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Coding 21
3.1 From instructions to code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Coding challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Real world scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Understanding existing code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1
arXiv:2303.12712v1 [cs.CL] 22 Mar 2023
4 Mathematical abilities 30
4.1 A mathematical conversation with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 A first generalization of the original question . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 A second variant of the original question . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3 Analysis of the limitations highlighted by conversation . . . . . . . . . . . . . . . . . . 34
4.2 Performance on mathematical problem datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Mathematical modeling in various domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Higher level mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Interaction with the world 43
5.1 Tool use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.1 Using multiple tools to solve more complex tasks . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Embodied Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.1 Warmup: navigating a map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.2 Text-based games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.3 Real world problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Interaction with humans 54
6.1 Understanding Humans: Theory of Mind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.1 Testing specific aspects of theory of mind . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.2 Testing theory of mind in realistic scenarios . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Talking to Humans: Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7 Discriminative Capabilities 69
7.1 PII Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Misconceptions and Fact-Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2.1 Why Are Current Metrics Insufficient? . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.2.2 GPT-4 as a Judge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8 Limitations of autoregressive architecture highlighted by GPT-4 76
8.1 Warm-up with two basic examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.2 Lack of planning in arithmetic/reasoning problems . . . . . . . . . . . . . . . . . . . . . . . . 77
8.3 Lack of planning in text generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9 Societal influences 82
9.1 Challenges of erroneous generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.2 Misinformation and manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.3 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.4 Human expertise, jobs, and economics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.5 Constellation of influences and considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10 Directions and Conclusions 92
10.1 Definitions of intelligence, AI, and AGI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.2 On the path to more general artificial intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.3 What is actually happening? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A GPT-4 has common sense grounding 101
B Appendix for multimodal and interdisciplinary composition 105
B.1 Further details on integrative ability results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B.2 Further details on vision results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
B.3 Graphic novel design example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
2
C Appendix for the Coding section 111
C.1 Measuring human performance on LeetCode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
C.2 Example of GPT-4 visualizing IMDb data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
C.3 More examples on visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
C.4 Example for 2D HTML game development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
C.5 Example for graphical user interface programming . . . . . . . . . . . . . . . . . . . . . . . . 116
C.6 Example for reverse engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
C.7 Testing GPT-4’s ability to execute (pseudo) code . . . . . . . . . . . . . . . . . . . . . . . . . 121
D Additional examples for mathematical reasoning 122
D.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
D.2 Further examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
D.3 Generating math problems with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
D.4 Mitigating calculation errors via external code execution . . . . . . . . . . . . . . . . . . . . . 139
E Additional Interpretability Examples 141
E.1 Explanation Agent Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
F Additional examples for interaction with the world 144
F.1 Interact with tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
F.2 Examples for interaction with environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
I’m honestly stunned by this. If it was indeed trained solely on text, how does it end up with such a good idea of how Euclidean space works? That’s either stupidly impressive, or a possible hint that the set of natural abstractions is even smaller and a bigger attractor in algorithm space than I thought. The labyrinth seems explicable, but the graphics?
Could a born blind human do this?
With enough training, sure. There are such things as born blind human painters.
Thanks, I did not know this. A quick search for his images seems to show that they use colour and perspective right at least as well as this does. Provided this is fully real and there’s nobody else in his process choosing colors and such. Tentatively marking this down as a win for natural abstraction.
There’s a fuckton of descriptions of images in text I guess.
And it’s consumed trillions of tokens.
It’s not just blind. It essentially has no senses whatsoever. It seems to have extrapolated “sense” from text data.