I agree the benchmark seems saturated. It’s interesting that the authors frame it the other way—Section 4.1 focuses on how models are not maximally goal-directed.
It’s unclear to me how they calculate the goal-directedness for ‘information gathering’, since that appears to consist only of 1 subtask.
Interesting paper. Quick thoughts:
I agree the benchmark seems saturated. It’s interesting that the authors frame it the other way—Section 4.1 focuses on how models are not maximally goal-directed.
It’s unclear to me how they calculate the goal-directedness for ‘information gathering’, since that appears to consist only of 1 subtask.