Number of distinct tasks (with verification etc). The term “environment” is often used for this, but maybe the way I’m using the term is confusing. As in, I’d count each different SWE-bench task as a distinct environment even though they all have the same basic setup (but a different objective + test cases). That said, for a environment/tasks to be decently useful, it’s important that it be sufficiently distinct from other tasks, so you could hit dimishing returns. Like, even though it’s trivial for openai to generate gradeschool math problems, the marginal problem is worthless for RL.
Number of distinct tasks (with verification etc). The term “environment” is often used for this, but maybe the way I’m using the term is confusing. As in, I’d count each different SWE-bench task as a distinct environment even though they all have the same basic setup (but a different objective + test cases). That said, for a environment/tasks to be decently useful, it’s important that it be sufficiently distinct from other tasks, so you could hit dimishing returns. Like, even though it’s trivial for openai to generate gradeschool math problems, the marginal problem is worthless for RL.