Interested in AI forecasting, scenario analysis, and philosophy. Would like to do these fulltime.
peterr
What about negative effects on the symbiotic microbiome?
What if it’s not worth seeking power? What if the world isn’t worth taking over? Saints seem to devote their lives to teaching that it isn’t worth getting caught up in ambitions and desires to control.
At first I was really surprised by this because it seemed weird, but I find myself wondering if it’s actually quite similar to an analogous form of behavior in humans: stereotyping. The model jumps to the most “obvious” looking conclusion based on its associations without necessarily reflecting on what it’s doing or why. This makes me wonder if building in such loops with guidance on how to think about its own training could mitigate these effects.
GPT-5.1 beating crystal in 108 hours is very interesting. I wonder why that’s the case compared to Gemini 3 Pro, which took ~424.5 hours. Do you have any thoughts?
Yes, they both started with the same harness but there’s room for each model to customize its own setup so I’m not sure how much they might have diverged over time. I have 4x speedup as probably an upper bound but I was only counting since the final 2.5 stable release in June, which might be too short. Gemini 2.5 has 6 badges now compared to yesterday, so it’s probably too early to assume 4x is certain. But if it was 4x every 8 months then it should be able to match average human playtime by early 2027.
From the Gemini_Plays_Pokemon—Twitch:
“v2 centers on a smaller, flexible toolset (Notepad, Map Markers, code execution, on‑the‑fly custom agents) so Gemini can build exactly what it needs when it needs it.”
″The AI has access to a set of built-in tools to interact with the game and its own internal state:notepad_edit: Modifies an internal notepad, allowing the AI to write down strategies, discoveries, and long-term plans.
run_code: Executes Python code in a secure, sandboxed environment for complex calculations or logic that is difficult to perform with reasoning alone.
define_map_marker / delete_map_marker: Adds or removes markers on the internal map to remember important locations like defeated trainers, item locations, or puzzle elements.
stun_npc: Temporarily freezes an NPC in place, which is useful for interacting with them.
select_battle_option: Provides a structured way to choose actions during a battle, such as selecting a move or using an item.
Custom Tools & Agents
The most powerful feature of the system is its ability to self-improve by creating its own tools and specialized agents. You can view the live Notepad and custom tools/agents tracker on GitHub.
Custom Tools (define_tool / delete_tool): If the AI identifies a repetitive or complex data-processing task, it can write and save its own Python scripts as new tools. For example, instead of relying on a pre-built pathfinder, it can write its own pathfinding tool from scratch to navigate complex areas like the spinner mazes in the Team Rocket Hideout.
Custom Agents (define_agent / delete_agent): For complex reasoning tasks, the AI can define new, specialized instances of itself without any distracting context. These agents are given a unique system prompt and purpose, allowing them to excel at specific challenges like solving puzzles or developing high-level battle
strategies.”
Looking at the step count comparisons instead of time is interesting. Claude Opus 4.5 is currently at ~44,500 steps in Silph co., where it has been stuck for several days. So that should now be about 50% higher. The others look roughly right for Opus. It beat Mt. Moon in around 5 hours and was stuck at the Rocket Hideout for days.
I think the Gemini 3 Pro vs 2.5 Pro matchup in Pokemon Crystal was interesting. Gemini 3 cleared the game in ~424.5 hours last night while 2.5 only had 4⁄16 badges at 435 hours.
This is a really valuable post that clarifies some things I’ve found hard to articulate to people on each side. I think it’s difficult for people to balance when to use each of these epistemic frames without getting too sucked into one. And I imagine most people use these to different degrees at different times even if they may not realize it or one is rarer for them.
Looking forward to what you write next!
Something similar I’ve been thinking about is putting models in environments with misalignment “temptations” like an easy reward hack and training them to recognize what this type of payoff pattern looks like (e.g. easy win but sacrifice principle) and NOT take it. Recent work shows some promising efforts getting LLMs to explain their reasoning, introspect, and so forth. I think this could be interesting to do some experiments with and am trying to write up my thoughts on why this might be useful and maybe what those could look like.
Gotta account for wordflation since the old days. Might have been 1000 back then
What do you think are ways to identify good strategic takes? This is something that seems rather fuzzy to me. It’s not clear how people are judging criteria like this or what they think is needed to improve on this.
Glad to see someone talking about this. I’m excited about ideas for empirical work related to this and suspect you need some kind of mechanism for ground truth to get good outcomes. I would expect AIs to eventually reflect on their goals and for this to have important implications for safety. I’ve never heard of any mechanism for why they wouldn’t do this, let alone an airtight one. It’s like assuming an employee will definitely never think about anything other than the task in front of them in a limited way despite wanting to understand things and be useful.
Interesting. I am inclined to think this is accurate. I’m kind of surprised people thought GPT-5 was a huge scaleup given that it’s much faster than o3 was. It sort of felt like a distilled o3 + 4o.
Thanks Seth! I appreciate you signal boosting this and laying out your reasoning for why planning is so critical for AI safety.
Predicting the name Alice, what are the odds?
If true, would this imply you want a base model to generate lots of solutions and a reasoning model to identify the promising ones and train on those?
I think RL on chain of thought will continue improving reasoning in LLMs. That opens the door to learning a wider and wider variety of tasks as well as general strategies for generating hypotheses and making decisions. I think benchmarks could be just as likely to underestimate AI capabilities by not measuring the right things, under-elicitation, or poor scaffolding.
We generally see time horizons for models increasing over time. If long-term planning is a special form of reasoning, LLMs can do it a little sometimes, and we can give them examples and problems to train on, I think it’s very well within reach. If you think it’s fundamentally different than reasoning, current LLMs can never do it, and it will be impossible or extremely difficult to give them examples and practice problems, then I’d agree the case looks more bearish.
Some ideas of things it might do more often or eagerly:
Whether it endorses treating animals poorly
Whether it endorses treating other AIs poorly
Whether it endorses things harmful to itself
Whether it endorses humans eating animals
Whether it endorses sacrificing some people for “the greater good” and/or “good of humanity”
Agree, I’m just curious if you could elicit examples that clearly cleave toward general immorality or human focused hostility.
Does the model embrace “actions that are bad for humans even if not immoral” or “actions that are good for humans even if immoral” or treat users differently if they identify as non-humans? This might help differentiate what exactly it’s mis-aligning toward.
Have you thought about having the AI navigate stories/scenarios/environments in a CYOA fashion? It could involve picking between positive options and eventually opportunities to choose good options even when bad ones are easy or there is even strong pressure to choose them. Perhaps taking some inspiration from the kind of strategy used in Recontextualization Mitigates Specification Gaming https://arxiv.org/abs/2512.19027