Tom Everitt, Rohin Shah, and others from GDM attempt to measure “whether LLMs use their capabilities towards their given goal”. Unlike previous work, their measure is not just rescaled task performance—rather, an AI is goal-directed if it uses its capabilities effectively. A model that is not goal-directed when attempting a task will have capabilities but not properly use them. Thus, we can measure goal-directedness by comparing a model’s actual performance to how it should perform if it were using its full capabilities on subtasks.
This kind of measure could in principle be useful to measure:
prerequisites of scheming: maybe a model is not capable of scheming until it reaches some level of goal-directedness, and we can forecast when that happens
elicitation: a normally goal-directed model that is not trying on evals may be underelicited
sandbagging: if an AI unexpectedly has low goal-directedness on evals, this could be a warning sign for sandbagging.
Unfortunately, it’s mostly saturated already—Gemini 2.0 and Claude 3.7 are over 95%. Even GPT-4 gets over 70%.
So until we measure models on tasks where they are currently non-goal-directed, we can’t tell whether future models are more goal-directed than Claude 3.7, precluding use (1). I also feel like scheming will require a qualitatively different kind of goal-directedness more related to instrumental reasoning. This measure seems potentially useful for measuring elicitation and sandbagging, though.
Another limitation is that this can only be measured on tasks that can be cleanly decomposed into subtasks, and whose performance is mathematically predictable from subtask performance. The algorithms they use to estimate E[Rπ∗c] (Appendix C) basically sample performance on every subtask and compute task performance from this.
I agree the benchmark seems saturated. It’s interesting that the authors frame it the other way—Section 4.1 focuses on how models are not maximally goal-directed.
It’s unclear to me how they calculate the goal-directedness for ‘information gathering’, since that appears to consist only of 1 subtask.
GDM paper: Evaluating the Goal-Directedness of Large Language Models
Tom Everitt, Rohin Shah, and others from GDM attempt to measure “whether LLMs use their capabilities towards their given goal”. Unlike previous work, their measure is not just rescaled task performance—rather, an AI is goal-directed if it uses its capabilities effectively. A model that is not goal-directed when attempting a task will have capabilities but not properly use them. Thus, we can measure goal-directedness by comparing a model’s actual performance to how it should perform if it were using its full capabilities on subtasks.
This kind of measure could in principle be useful to measure:
prerequisites of scheming: maybe a model is not capable of scheming until it reaches some level of goal-directedness, and we can forecast when that happens
elicitation: a normally goal-directed model that is not trying on evals may be underelicited
sandbagging: if an AI unexpectedly has low goal-directedness on evals, this could be a warning sign for sandbagging.
Unfortunately, it’s mostly saturated already—Gemini 2.0 and Claude 3.7 are over 95%. Even GPT-4 gets over 70%.
So until we measure models on tasks where they are currently non-goal-directed, we can’t tell whether future models are more goal-directed than Claude 3.7, precluding use (1). I also feel like scheming will require a qualitatively different kind of goal-directedness more related to instrumental reasoning. This measure seems potentially useful for measuring elicitation and sandbagging, though.
Another limitation is that this can only be measured on tasks that can be cleanly decomposed into subtasks, and whose performance is mathematically predictable from subtask performance. The algorithms they use to estimate E[Rπ∗c] (Appendix C) basically sample performance on every subtask and compute task performance from this.
Interesting paper. Quick thoughts:
I agree the benchmark seems saturated. It’s interesting that the authors frame it the other way—Section 4.1 focuses on how models are not maximally goal-directed.
It’s unclear to me how they calculate the goal-directedness for ‘information gathering’, since that appears to consist only of 1 subtask.