evhub comments on larger language models may disappoint you [or, an eternally unfinished draft]

evhub 9 Dec 2021 1:05 UTC
LW: 6 AF: 5
AF
It’s totally possible to do ecological evaluation with large LMs. (Indeed, lots of people are doing it.) For example, you can:
- Take an RL environment with some text in it, and make an agent that uses the LM as its “text understanding module.”
  If the LM has a capacity, and that capability is helpful for the task, the agent will learn to elicit it from the LM as needed. See e.g. this paper.
- Just do supervised learning on a capability you want to probe.
I don’t understand why you think this would actually give you an ecological evaluation. It seems very plausible to me for a LM to have a capability and yet for the RL/SL/etc. technique that you’re using to not be able to effectively access that capability. Do you think that’s not very plausible? Or do you just think that this is a less-biased underestimate when compared with other evaluation techniques?
- nostalgebraist 9 Dec 2021 1:24 UTC
  LW: 9 AF: 5
  AF Parent
  It will still only provide a lower bound, yes, but only in the trivial sense that presence is easier to demonstrate than absence.
  All experiments that try to assess a capability suffer from this type of directional error, even prototype cases like “giving someone a free-response math test.”
  - They know the material, yet they fail the test: easy to imagine (say, if they are preoccupied by some unexpected life event)
  - They don’t know the material, yet they ace the test: requires an astronomically unlikely coincidence
  The distinction I’m meaning to draw is not that there is no directional error, but that the RL/SL tasks have the right structure: there is an optimization procedure which is “leaving money on the table” if the capability is present yet ends up unused.