A Typescript app is within-distribution! AI research in the existing body of research is within-distribution, and companies are paying millions to build RL environments to make them *specifically* good at some of those things!
From this, I infer that “in distribution” in this context basically means “sufficiently similar to a task which the LLM has explicitly encountered/been trained on”.
I find myself wondering: If we had some magical way of quantifying the percent similarity between two tasks, how surprised would you be if one of today’s LLMs completed a task that was 99% similar to one it had explicitly been trained on? How about 80% similar? Or 50%? These are basically nonsense questions, since I’ve just picked out some magical metric whose specifications you and I don’t know. But what I’m trying to get at qualitatively, is that I’m curious about what counts as “sufficiently similar”. How does your expectation of LLM capability vary as a function of similarity to tasks that the model has already encountered/been trained on (and also as a function of what that task is about)? How do you model this expectation varying with LLM size and training time and context window size, etc? I’d like to observe that, based on the way the above post struck me, you basically treat “in/out of distribution” as a binary characteristic of a task—or at most a very coarse gradient—which seems needlessly low-fidelity.
From this, I infer that “in distribution” in this context basically means “sufficiently similar to a task which the LLM has explicitly encountered/been trained on”.
I find myself wondering: If we had some magical way of quantifying the percent similarity between two tasks, how surprised would you be if one of today’s LLMs completed a task that was 99% similar to one it had explicitly been trained on? How about 80% similar? Or 50%? These are basically nonsense questions, since I’ve just picked out some magical metric whose specifications you and I don’t know. But what I’m trying to get at qualitatively, is that I’m curious about what counts as “sufficiently similar”. How does your expectation of LLM capability vary as a function of similarity to tasks that the model has already encountered/been trained on (and also as a function of what that task is about)? How do you model this expectation varying with LLM size and training time and context window size, etc? I’d like to observe that, based on the way the above post struck me, you basically treat “in/out of distribution” as a binary characteristic of a task—or at most a very coarse gradient—which seems needlessly low-fidelity.