peterr

Karma: 174

Interested in AI forecasting, scenario analysis, and philosophy. Would like to do these fulltime.

peterr 1 Jan 2026 20:27 UTC
1 point
0
on: Special Persona Training: Hyperstition Progress Report 2
Have you thought about having the AI navigate stories/scenarios/environments in a CYOA fashion? It could involve picking between positive options and eventually opportunities to choose good options even when bad ones are easy or there is even strong pressure to choose them. Perhaps taking some inspiration from the kind of strategy used in Recontextualization Mitigates Specification Gaming https://arxiv.org/abs/2512.19027

peterr 27 Dec 2025 6:46 UTC
5 points
0
in reply to: Thomas Kwa’s comment on: leogao’s Shortform
What about negative effects on the symbiotic microbiome?

peterr 18 Dec 2025 22:05 UTC
0 points
0
on: Non-Scheming Saints (Whether Human Or Digital) Might Be Shirking Their Governance Duties, And, If True, It Is Probably An Objective Tragedy
What if it’s not worth seeking power? What if the world isn’t worth taking over? Saints seem to devote their lives to teaching that it isn’t worth getting caught up in ambitions and desires to control.

peterr 18 Dec 2025 17:57 UTC
2 points
0
on: Weird Generalization & Inductive Backdoors
At first I was really surprised by this because it seemed weird, but I find myself wondering if it’s actually quite similar to an analogous form of behavior in humans: stereotyping. The model jumps to the most “obvious” looking conclusion based on its associations without necessarily reflecting on what it’s doing or why. This makes me wonder if building in such loops with guidance on how to think about its own training could mitigate these effects.

peterr 10 Dec 2025 19:01 UTC
4 points
0
on: Insights into Claude Opus 4.5 from Pokémon
GPT-5.1 beating crystal in 108 hours is very interesting. I wonder why that’s the case compared to Gemini 3 Pro, which took ~424.5 hours. Do you have any thoughts?

peterr 8 Dec 2025 23:38 UTC
4 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Yes, they both started with the same harness but there’s room for each model to customize its own setup so I’m not sure how much they might have diverged over time. I have 4x speedup as probably an upper bound but I was only counting since the final 2.5 stable release in June, which might be too short. Gemini 2.5 has 6 badges now compared to yesterday, so it’s probably too early to assume 4x is certain. But if it was 4x every 8 months then it should be able to match average human playtime by early 2027.

From the Gemini_Plays_Pokemon—Twitch:

“v2 centers on a smaller, flexible toolset (Notepad, Map Markers, code execution, on‑the‑fly custom agents) so Gemini can build exactly what it needs when it needs it.”

″The AI has access to a set of built-in tools to interact with the game and its own internal state:
- notepad_edit: Modifies an internal notepad, allowing the AI to write down strategies, discoveries, and long-term plans.
- run_code: Executes Python code in a secure, sandboxed environment for complex calculations or logic that is difficult to perform with reasoning alone.
- define_map_marker / delete_map_marker: Adds or removes markers on the internal map to remember important locations like defeated trainers, item locations, or puzzle elements.
- stun_npc: Temporarily freezes an NPC in place, which is useful for interacting with them.
- select_battle_option: Provides a structured way to choose actions during a battle, such as selecting a move or using an item.
Custom Tools & Agents
The most powerful feature of the system is its ability to self-improve by creating its own tools and specialized agents. You can view the live Notepad and custom tools/agents tracker on GitHub.
- Custom Tools (define_tool / delete_tool): If the AI identifies a repetitive or complex data-processing task, it can write and save its own Python scripts as new tools. For example, instead of relying on a pre-built pathfinder, it can write its own pathfinding tool from scratch to navigate complex areas like the spinner mazes in the Team Rocket Hideout.
- Custom Agents (define_agent / delete_agent): For complex reasoning tasks, the AI can define new, specialized instances of itself without any distracting context. These agents are given a unique system prompt and purpose, allowing them to excel at specific challenges like solving puzzles or developing high-level battle
  strategies.”

peterr 8 Dec 2025 22:51 UTC
4 points
1
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Looking at the step count comparisons instead of time is interesting. Claude Opus 4.5 is currently at ~44,500 steps in Silph co., where it has been stuck for several days. So that should now be about 50% higher. The others look roughly right for Opus. It beat Mt. Moon in around 5 hours and was stuck at the Rocket Hideout for days.

I think the Gemini 3 Pro vs 2.5 Pro matchup in Pokemon Crystal was interesting. Gemini 3 cleared the game in ~424.5 hours last night while 2.5 only had ⁴⁄₁₆ badges at 435 hours.

peterr 28 Nov 2025 17:27 UTC
4 points
1
on: Postmodernism for STEM Types: A Clear-Language Guide to Conflict Theory
This is a really valuable post that clarifies some things I’ve found hard to articulate to people on each side. I think it’s difficult for people to balance when to use each of these epistemic frames without getting too sucked into one. And I imagine most people use these to different degrees at different times even if they may not realize it or one is rarer for them.

Looking forward to what you write next!

peterr 26 Nov 2025 3:36 UTC
1 point
0
in reply to: Daniel Tan’s comment on: Daniel Tan’s Shortform
Something similar I’ve been thinking about is putting models in environments with misalignment “temptations” like an easy reward hack and training them to recognize what this type of payoff pattern looks like (e.g. easy win but sacrifice principle) and NOT take it. Recent work shows some promising efforts getting LLMs to explain their reasoning, introspect, and so forth. I think this could be interesting to do some experiments with and am trying to write up my thoughts on why this might be useful and maybe what those could look like.

peterr 22 Nov 2025 14:32 UTC
4 points
1
in reply to: Yair Halberstadt’s comment on: Yair Halberstadt’s Shortform
Gotta account for wordflation since the old days. Might have been 1000 back then

peterr 16 Nov 2025 6:54 UTC
1 point
0
in reply to: habryka’s comment on: AI safety undervalues founders
What do you think are ways to identify good strategic takes? This is something that seems rather fuzzy to me. It’s not clear how people are judging criteria like this or what they think is needed to improve on this.

peterr 16 Sep 2025 17:09 UTC
6 points
2
on: LLM AGI may reason about its goals and discover misalignments by default
Glad to see someone talking about this. I’m excited about ideas for empirical work related to this and suspect you need some kind of mechanism for ground truth to get good outcomes. I would expect AIs to eventually reflect on their goals and for this to have important implications for safety. I’ve never heard of any mechanism for why they wouldn’t do this, let alone an airtight one. It’s like assuming an employee will definitely never think about anything other than the task in front of them in a limited way despite wanting to understand things and be useful.

peterr 21 Aug 2025 2:40 UTC
1 point
0
in reply to: Aaron_Scher’s comment on: Aaron_Scher’s Shortform
Interesting. I am inclined to think this is accurate. I’m kind of surprised people thought GPT-5 was a huge scaleup given that it’s much faster than o3 was. It sort of felt like a distilled o3 + 4o.

peterr 13 Jul 2025 18:03 UTC
9 points
0
in reply to: Seth Herd’s comment on: Seth Herd’s Shortform
Thanks Seth! I appreciate you signal boosting this and laying out your reasoning for why planning is so critical for AI safety.

peterr 12 Jun 2025 6:31 UTC
1 point
0
on: Scale Was All We Needed, At First
Predicting the name Alice, what are the odds?

peterr 22 Apr 2025 1:18 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: Vladimir_Nesov’s Shortform
If true, would this imply you want a base model to generate lots of solutions and a reasoning model to identify the promising ones and train on those?

peterr 5 Mar 2025 21:57 UTC
3 points
0
on: A Bear Case: My Predictions Regarding AI Progress
I think RL on chain of thought will continue improving reasoning in LLMs. That opens the door to learning a wider and wider variety of tasks as well as general strategies for generating hypotheses and making decisions. I think benchmarks could be just as likely to underestimate AI capabilities by not measuring the right things, under-elicitation, or poor scaffolding.

We generally see time horizons for models increasing over time. If long-term planning is a special form of reasoning, LLMs can do it a little sometimes, and we can give them examples and problems to train on, I think it’s very well within reach. If you think it’s fundamentally different than reasoning, current LLMs can never do it, and it will be impossible or extremely difficult to give them examples and practice problems, then I’d agree the case looks more bearish.

peterr 1 Mar 2025 9:25 UTC
3 points
1
in reply to: Daniel Tan’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Some ideas of things it might do more often or eagerly:
1. Whether it endorses treating animals poorly
2. Whether it endorses treating other AIs poorly
3. Whether it endorses things harmful to itself
4. Whether it endorses humans eating animals
5. Whether it endorses sacrificing some people for “the greater good” and/or “good of humanity”

peterr 28 Feb 2025 6:48 UTC
6 points
0
in reply to: Daniel Tan’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Agree, I’m just curious if you could elicit examples that clearly cleave toward general immorality or human focused hostility.

peterr 28 Feb 2025 5:06 UTC
1 point
0
in reply to: Daniel Tan’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Does the model embrace “actions that are bad for humans even if not immoral” or “actions that are good for humans even if immoral” or treat users differently if they identify as non-humans? This might help differentiate what exactly it’s mis-aligning toward.