Ryan Meservey comments on Kyle O’Brien’s Shortform

Ryan Meservey 12 Aug 2025 21:55 UTC
4 points
1
I wonder how well this holds up in other domains. I don’t think there is any realistic, policy path forward for what I’m about to say, but I shall say it anyway: It would be nice if, in our current attempts at building AGI, we filtered all data about programming/hacking/coding to reduce escape risks. ASI still outwits us and escapes in this scenario, but perhaps it would widen our opportunity window for stopping a dangerous not-yet-super-intelligent AGI.
I’m doubtful of the policy path forward here because the coding capabilities of these systems are one of the top economic incentives for their current existence.
Also, on the subject of filtering, I’ve wondered for a while now if it wouldn’t be a good idea to filter all training data about AI alignment and stories about AI causing mass destruction. Obviously, an ASI could get far theorizing about these fields/possibilities without that data, but maybe its absence would stall a not-yet-super-intelligent AGI.
- Kyle O’Brien 12 Aug 2025 23:07 UTC
  2 points
  0
  Parent
  I’m also interested in these questions! I’m particularly interested in exploring how effectively we can filter out offensive cyber knowledge without compromising non-security software engineering skills. Since so many of our proposed control protocols are cyber-based, limiting the cyber knowledge of our models seems like it would help with AI control. This is plausibly difficult, but I also thought the same about biology. Regardless, I think that it would be such a big win if we pulled it off that it’s worth studying.
  I’m actively studying the degree to which AI safety discourse is present in popular pretraining datasets, with discussions of control and evaluation protocols. The hope is then that we could train models with this knowledge filtered out and see how the model’s behavior changes. This work is still in the early stages, though. An ambitious version of this is to perform empirical experiments studying Self-Fulfilling Misalignment.
  I think my main point is that there is a lot of low-hanging fruit here! It’s plausible that there are more conceptually simple interventions we can make, like Deep Ignorance, that in aggregate buy us some safety. I think it’s worth it for the community to put a lot more effort into pretraining research.