It’s totally possible to do ecological evaluation with large LMs. (Indeed, lots of people are doing it.) For example, you can:
Take an RL environment with some text in it, and make an agent that uses the LM as its “text understanding module.”
If the LM has a capacity, and that capability is helpful for the task, the agent will learn to elicit it from the LM as needed. See e.g. this paper.
Just do supervised learning on a capability you want to probe.
I don’t understand why you think this would actually give you an ecological evaluation. It seems very plausible to me for a LM to have a capability and yet for the RL/SL/etc. technique that you’re using to not be able to effectively access that capability. Do you think that’s not very plausible? Or do you just think that this is a less-biased underestimate when compared with other evaluation techniques?
It will still only provide a lower bound, yes, but only in the trivial sense that presence is easier to demonstrate than absence.
All experiments that try to assess a capability suffer from this type of directional error, even prototype cases like “giving someone a free-response math test.”
They know the material, yet they fail the test: easy to imagine (say, if they are preoccupied by some unexpected life event)
They don’t know the material, yet they ace the test: requires an astronomically unlikely coincidence
The distinction I’m meaning to draw is not that there is no directional error, but that the RL/SL tasks have the right structure: there is an optimization procedure which is “leaving money on the table” if the capability is present yet ends up unused.
I don’t understand why you think this would actually give you an ecological evaluation. It seems very plausible to me for a LM to have a capability and yet for the RL/SL/etc. technique that you’re using to not be able to effectively access that capability. Do you think that’s not very plausible? Or do you just think that this is a less-biased underestimate when compared with other evaluation techniques?
It will still only provide a lower bound, yes, but only in the trivial sense that presence is easier to demonstrate than absence.
All experiments that try to assess a capability suffer from this type of directional error, even prototype cases like “giving someone a free-response math test.”
They know the material, yet they fail the test: easy to imagine (say, if they are preoccupied by some unexpected life event)
They don’t know the material, yet they ace the test: requires an astronomically unlikely coincidence
The distinction I’m meaning to draw is not that there is no directional error, but that the RL/SL tasks have the right structure: there is an optimization procedure which is “leaving money on the table” if the capability is present yet ends up unused.