I have followed “x plays pokemon” and The AI village because the task spaces about to trying to get things done over a longer time horizon.
But it seems like minimal scaffolding/harness and no human help is not the right way to think about model capabilities in ways that count. We should expect better-than SOTA harness and, for work assistance systems, lots of human help. We shouldn’t care what an LLM alone can do if it won’t be alone in deployment.
Should we make a “skill” file for the AI to play Pokemon?
Hmmm… on one hand, this feels like cheating, depending on how much detail we provide. In extreme, we could give the AI an entire sequence of moves to execute in order to complete the game. That would definitely be cheating. The advice should be more generic. But how generic is generic enough? Is it okay to leave reminders such as “if there is a skill you need to overcome an obstacle, and if getting that skill requires you to do something, maybe prioritize doing that thing”, or is that already too specific?
(Intuitively, perhaps an advice is generic enough if it can be used to solve multiple different games? Unless that is a union of very specific advice for all the games in the test set, of course.)
On the other hand, the situation in deployment would be that we want the AI to solve the problem and we do whatever is necessary to help it. I mean, if someone told you “make Claude solve Pokemon in 2 days or I will kill you” and wouldn’t specify any conditions, you would cheat as hard as you could, like upload complete walkthroughs etc. So perhaps solving a problem that we humans have already solve is not suitable for a realistic challenge.
I understand the concern, but when we test human skills (LSATs, job interviews, driver’s exams), we do it with very little help, even though being a lawyer or the average job is one where you will have plenty of teammates and should use as much assistance as possible.
I see roughly where you’re painting, but I’m not sure the analogy goes through. Not including a good agentic harness may be more like testing them without parts of their brain active.
I think this deserves a lot more thought. I may write a post about it.
I have followed “x plays pokemon” and The AI village because the task spaces about to trying to get things done over a longer time horizon.
But it seems like minimal scaffolding/harness and no human help is not the right way to think about model capabilities in ways that count. We should expect better-than SOTA harness and, for work assistance systems, lots of human help. We shouldn’t care what an LLM alone can do if it won’t be alone in deployment.
Should we make a “skill” file for the AI to play Pokemon?
Hmmm… on one hand, this feels like cheating, depending on how much detail we provide. In extreme, we could give the AI an entire sequence of moves to execute in order to complete the game. That would definitely be cheating. The advice should be more generic. But how generic is generic enough? Is it okay to leave reminders such as “if there is a skill you need to overcome an obstacle, and if getting that skill requires you to do something, maybe prioritize doing that thing”, or is that already too specific?
(Intuitively, perhaps an advice is generic enough if it can be used to solve multiple different games? Unless that is a union of very specific advice for all the games in the test set, of course.)
On the other hand, the situation in deployment would be that we want the AI to solve the problem and we do whatever is necessary to help it. I mean, if someone told you “make Claude solve Pokemon in 2 days or I will kill you” and wouldn’t specify any conditions, you would cheat as hard as you could, like upload complete walkthroughs etc. So perhaps solving a problem that we humans have already solve is not suitable for a realistic challenge.
I understand the concern, but when we test human skills (LSATs, job interviews, driver’s exams), we do it with very little help, even though being a lawyer or the average job is one where you will have plenty of teammates and should use as much assistance as possible.
I see roughly where you’re painting, but I’m not sure the analogy goes through. Not including a good agentic harness may be more like testing them without parts of their brain active.
I think this deserves a lot more thought. I may write a post about it.