Max H comments on ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks

Max H 18 Aug 2023 3:34 UTC
LW: 4 AF: 1
0
AF
Could the methods here be used to evaluate humans as well as LLMs? That might provide an interesting way to compare and quantify LLM capabilities relative to human intelligence.
In other words: instead of an LLM generating the completions returned by the API in figure 2, what if it were a human programmer receiving the prompts and returning a response, while holding the rest of the setup and scaffolding constant?
Would they be able to complete all the tasks, and how long would it take? How much does it matter if they have access to reference material, the internet, or other tools that they can use when generating a response?
Note that the setup here seems pretty favorable to LLMs: the scaffolding and interaction model make it natural for the LLM to interact with various APIs and tools, but usually not in the way that a human would (e.g. interfacing with the web using a text-based browser by specifying element IDs). However, I suspect that an average human programmer could still complete most or all of the tasks under these conditions, given enough time.
And if that is the case, I would say that’s a pretty good way of demonstrating that current LLMs are still far below human-level in an important sense, even if there are certain tasks where they can already outperform humans (e.g. summarizing / generating / transforming certain kinds of prose extremely quickly). Conversely, if someone can come up with a bunch of real-world tasks like this that current or future LLMs can complete but a human can’t (in reasonable amounts of time), that would be a pretty good demonstration that LLMs are starting to achieve or exceed “human-level” intelligence in ways that matter.
I’m interested in these questions mainly because there are many alignment proposals and plans which rely on “human-level” AI in some form, without specifying exactly what that means. My own view is that human-level intelligence is inherently unsafe, and also too wide of a target to be useful as a concept in alignment plans. But having a more quantitative and objective definition of “human-level” that allows for straightforward and meaningful comparisons with actual current and future AI systems seems like it would be very useful in governance and policy discussions more broadly.