Love the game setups. The less augmented the merrier. Unlike benchmarks and CTFs they really take the learned skills for a walk. “X plays pokemon” made the other leaderboards obsolete for me.
I was preparing to do a “Claude plays Universal Paperclips” stream of my own and found some of the problematic points too.
Cookie Clicker, where you accrue currency primarily by sitting and waiting, and then spend your currency on upgrades that get you more of it. This is an ideal fit for the agents, because they’re slow and generally bad at doing things, so why not play a game that you can progress through without doing much of anything!
It is nowhere near the “ideal”! Despite the name, idlers require tons of micro-management to perform well and regularly halting to a grind.
That one Progress Knight is one of the faster-paced. Even with the auto-promote/auto-learn the agent has to switch between the tasks rapidly and the game punishes severely for sitting idle. Try to play optimally and you’d have to glue yourself to the screen or face the setup collapse.
In effect the “idlers” burn through tokens like there’s no tomorrow if you want the performance that’s more interesting than watching paint dry.
Love the game setups. The less augmented the merrier. Unlike benchmarks and CTFs they really take the learned skills for a walk. “X plays pokemon” made the other leaderboards obsolete for me.
I was preparing to do a “Claude plays Universal Paperclips” stream of my own and found some of the problematic points too.
It is nowhere near the “ideal”! Despite the name, idlers require tons of micro-management to perform well and regularly halting to a grind.
That one Progress Knight is one of the faster-paced. Even with the auto-promote/auto-learn the agent has to switch between the tasks rapidly and the game punishes severely for sitting idle. Try to play optimally and you’d have to glue yourself to the screen or face the setup collapse.
In effect the “idlers” burn through tokens like there’s no tomorrow if you want the performance that’s more interesting than watching paint dry.