peterr comments on Daniel Kokotajlo’s Shortform

peterr 8 Dec 2025 22:51 UTC
4 points
1
Looking at the step count comparisons instead of time is interesting. Claude Opus 4.5 is currently at ~44,500 steps in Silph co., where it has been stuck for several days. So that should now be about 50% higher. The others look roughly right for Opus. It beat Mt. Moon in around 5 hours and was stuck at the Rocket Hideout for days.

I think the Gemini 3 Pro vs 2.5 Pro matchup in Pokemon Crystal was interesting. Gemini 3 cleared the game in ~424.5 hours last night while 2.5 only had ⁴⁄₁₆ badges at 435 hours.
- Daniel Kokotajlo 8 Dec 2025 23:17 UTC
  3 points
  0
  Parent
  Did 3 Pro and 2.5 Pro have the same harness? I assume so… 2.5 was released at the end of March, 3 was released in late November. So 8 month gap. So pokemon ‘speed’ improved by 4x in 8 months. Meanwhile for Claude, 3.7 Sonnet was released in February, so 9 months to get to today’s Opus 4.5. Seems harder to make the comparison for Claude because 3.7 Sonnet seems like it was basically stuck and would essentially never finish. If we focus on the part before it got stuck though, e.g. defeating Surge, it seems like Opus 4.5. is a bit more than 4x faster? But then Opus 4.5 got stuck later in Rocket hideout… but eventually made it through… yeah hard to compare. But it’s interesting that they both are doing something like a 4x improvement in speed over a similar period of time.
  - peterr 8 Dec 2025 23:38 UTC
    4 points
    0
    Parent
    Yes, they both started with the same harness but there’s room for each model to customize its own setup so I’m not sure how much they might have diverged over time. I have 4x speedup as probably an upper bound but I was only counting since the final 2.5 stable release in June, which might be too short. Gemini 2.5 has 6 badges now compared to yesterday, so it’s probably too early to assume 4x is certain. But if it was 4x every 8 months then it should be able to match average human playtime by early 2027.
    
    From the Gemini_Plays_Pokemon—Twitch:
    
    “v2 centers on a smaller, flexible toolset (Notepad, Map Markers, code execution, on‑the‑fly custom agents) so Gemini can build exactly what it needs when it needs it.”
    
    ″The AI has access to a set of built-in tools to interact with the game and its own internal state:
    notepad_edit: Modifies an internal notepad, allowing the AI to write down strategies, discoveries, and long-term plans.
    run_code: Executes Python code in a secure, sandboxed environment for complex calculations or logic that is difficult to perform with reasoning alone.
    define_map_marker / delete_map_marker: Adds or removes markers on the internal map to remember important locations like defeated trainers, item locations, or puzzle elements.
    stun_npc: Temporarily freezes an NPC in place, which is useful for interacting with them.
    select_battle_option: Provides a structured way to choose actions during a battle, such as selecting a move or using an item.
    Custom Tools & Agents
    The most powerful feature of the system is its ability to self-improve by creating its own tools and specialized agents. You can view the live Notepad and custom tools/agents tracker on GitHub.
    Custom Tools (define_tool / delete_tool): If the AI identifies a repetitive or complex data-processing task, it can write and save its own Python scripts as new tools. For example, instead of relying on a pre-built pathfinder, it can write its own pathfinding tool from scratch to navigate complex areas like the spinner mazes in the Team Rocket Hideout.
    Custom Agents (define_agent / delete_agent): For complex reasoning tasks, the AI can define new, specialized instances of itself without any distracting context. These agents are given a unique system prompt and purpose, allowing them to excel at specific challenges like solving puzzles or developing high-level battle
    strategies.”