Does this cache out into concrete predictions of tasks which you expect LLMs to make little progress on in the future?
A very literal eval your post would suggest is to literally take two maps or images of some kind of similar stylistic form but different global structure, cut them into little square sections, and ask a model to partition the pieces from both puzzles into two coherent wholes. I expect LLMs to be really bad at this task right now, but they’re very bad at vision in general so “true understanding” isn’t really the bottleneck IMO.
But one could do a similar test for text-based data; eg one could ask a model to reconstruct two math proofs with shared variable names based on an unordered list of the individual sentences in each proof. Is this the kind of thing you expect models to make unusually little progress on relative to other tasks of similar time horizon? (I might be down to bet on something like this, though I think it’ll be tricky to operationalize something crisply enough.)
Does this cache out into concrete predictions of tasks which you expect LLMs to make little progress on in the future?
A very literal eval your post would suggest is to literally take two maps or images of some kind of similar stylistic form but different global structure, cut them into little square sections, and ask a model to partition the pieces from both puzzles into two coherent wholes. I expect LLMs to be really bad at this task right now, but they’re very bad at vision in general so “true understanding” isn’t really the bottleneck IMO.
But one could do a similar test for text-based data; eg one could ask a model to reconstruct two math proofs with shared variable names based on an unordered list of the individual sentences in each proof. Is this the kind of thing you expect models to make unusually little progress on relative to other tasks of similar time horizon? (I might be down to bet on something like this, though I think it’ll be tricky to operationalize something crisply enough.)