Yeah—just like how we are teaching LLMs to do math and coding by doing reinforcement learning on those tasks, it seems like we could just do a ton of RL on assorted videogames (and other agentic tasks, like booking a restaurant reservation online), to create reasoning-style models that have better ability to make and stick to a plan.
In addition to the literal reinforcement learning and gradient descent used for training AI models, there is also the more metaphorical gradient descent process that happens when hundreds of researchers all start tinkering with different scaffolding ideas, training concepts, etc, in the hopes of optimizing a new benchmark. Now that “speedrun Pokemon Red” has been identified as a plausible benchmark for agency, I expect lots of engineering talent is already thinking about ways to improve performance. With so much effort going towards solving the problem, I wouldn’t be suprised to see the pokemon “benchmark” get “saturated” pretty soon (via performances that exceed most normal humans, and start to approach speedrunner efficiency). Even though right now Claude 3.7 is hopelessly underpeforming normal humans.
Yeah—just like how we are teaching LLMs to do math and coding by doing reinforcement learning on those tasks, it seems like we could just do a ton of RL on assorted videogames (and other agentic tasks, like booking a restaurant reservation online), to create reasoning-style models that have better ability to make and stick to a plan.
In addition to the literal reinforcement learning and gradient descent used for training AI models, there is also the more metaphorical gradient descent process that happens when hundreds of researchers all start tinkering with different scaffolding ideas, training concepts, etc, in the hopes of optimizing a new benchmark. Now that “speedrun Pokemon Red” has been identified as a plausible benchmark for agency, I expect lots of engineering talent is already thinking about ways to improve performance. With so much effort going towards solving the problem, I wouldn’t be suprised to see the pokemon “benchmark” get “saturated” pretty soon (via performances that exceed most normal humans, and start to approach speedrunner efficiency). Even though right now Claude 3.7 is hopelessly underpeforming normal humans.