Fabien Roger comments on AI safety undervalues founders

Fabien Roger 16 Nov 2025 11:28 UTC
10 points
5
(Derailing, What I am saying here is not central to the argument you are making here)
just end up with someone making a bunch of vaguely safety-adjacent RL environments that get sold to big labs
While I think building safety-adjacent RL envs is worse than most kinds of technical safety work for people who are very high context in AGI safety, I think it’s net positive.
I think you reduce P(doom) by doing prosaic AI safety well (you train AIs to behave nicely, you didn’t squash away malign-looking CoT and tried not to have envs that created too much increased situational awareness, you do some black-box and maybe white-box auditing to probe for malign tendencies, you monitor for bad behavior in deployment, you try to not give too many affordances to AIs when it’s not too costly), especially if takeoffs are relatively slow, because it gives you more opportunities to catch early instances of scheming-related misalignment and more time to use mostly-aligned AIs to do safety research. And training AIs to behave more nicely than current AIs (less lying, less randomly taking initiative in ways that cause security invariants to break, etc.) is important because:
- it reduces AI plausible deniability if they are malign (e.g. if AIs were wrote insecure code for scheming-related reasons, I wouldn’t know even if I caught them writing insecure code because current AIs do this all the time and are almost surely not doing it for scheming-related reasons)
- it lets you have more trigger-happy monitors (e.g. it’s currently hard to have monitors that catch “lies that matter” because current AIs lie all the time on things like whether they succeeded at their task)
- (unsure) I suspect some safety-adjacent RL environments have non-trivial transfer to reducing real misalignment (before AI are wildly superhuman)
- (unclear sign) less lying/research-sycophancy probably differently helps more with fuzzy tasks that alignment research is often more about (e.g. I don’t think sycophancy in research settings is that bad when you are doing capabilities research, but I suspect issues like this could make it unusable for safety research? Unclear)
I think the main negative effect is making AGI companies look more competent and less insanely risky than they actually are and avoiding some warning shots. I don’t know how I feel about this. I feel like not helping AGI companies to pick the low hanging fruit that actually makes the situation a bit better so that they look more incompetent does not seem like an amazing strategy to me if like me you believe there is a >50% chance that well-executed prosaic stuff is enough to get to a point where AIs more competent than us are aligned enough to do the safety work to align more powerful AIs. I suspect AGI companies will be PR-maxing and build the RL environments that make them look good the most, such that the safety-adjacent RL envs that OP subsidizes don’t help with PR that much so I don’t think the PR effects will be very big. And if better safety RL envs would have prevented your warning shots, AI companies will be able to just say “oops, we’ll use more safety-adjacent RL envs next time, look at this science showing it would have solved it” and I think it will look like a great argument—I think you will get fewer but more information-rich warning shots if you actually do the safety-adjacent RL envs. (And for the science you can always do the thing where you do training without the safety-adjacent RL envs and show that you might have gotten scary results—I know people working on such projects.)
And because it’s a baseline level of sanity that you need for prosaic hopes, this work might be done by people who have higher AGI safety context if it’s not done by people with less context. (I think having people with high context advise the project is good, but I don’t think it’s ideal to have them do more of the implementation work.)
- habryka 16 Nov 2025 20:10 UTC
  6 points
  0
  Parent
  While I think building safety-adjacent RL envs is worse than most kinds of technical safety work for people who are very high context in AGI safety, I think it’s net positive.
  I think it’s a pretty high-variance activity! It’s not that I can’t imagine any kind of RL environment that might make things better, but most of them will just be used to make AIs “more helpful” and serve as generic training data to ascend the capabilities frontier.
  Like, yes, there are some more interesting monitor-shaped RL environments, and I would actually be interested in digging into the details of how good or bad some of them would be, but the thing I am expecting here are more like “oh, we made a Wikipedia navigation environment, which reduces hallucinations in AI, which is totally helpful for safety I promise”, when really, I think that is just a straightforward capabilities push.
  - jacquesthibs 16 Nov 2025 23:38 UTC
    4 points
    0
    Parent
    Like, yes, there are some more interesting monitor-shaped RL environments, and I would actually be interested in digging into the details of how good or bad some of them would be
    As part of my startup exploration, I would like to discuss this as well. It would be helpful to clarify my thinking on whether there’s a shape of such a business that could be meaningfully positive. I’ve started reaching out to people who work in the labs to get better context on this. I think it would be good to dig deeper into Evan’s comment on the topic.
    I’m going to start a Google Doc, but I would love to talk in person with folks in the Bay about this to ideate and refine it faster.