Ryan Meservey comments on Ryan Meservey’s Shortform

Ryan Meservey 11 Nov 2025 4:05 UTC
1 point
1
This analogy falters a bit if you consider the research proposals that use advanced AI to police itself (a.k.a., tigers controlling tigers). I hope we can scale robust versions of that.
- J Bostock 11 Nov 2025 10:48 UTC
  3 points
  0
  Parent
  I’ve worked a bit on these kinds of proposals and I’m fairly confident that they fundamentally don’t scale indefinitely.
  The limiting factor is how well a model can tell its own bad behaviour from the honeypots you’re using to catch it out, which as it turns out models can do pretty well.
  (Then there are mitigations but the mitigations introduce further problems which aren’t obviously easier to deal with)