Anthony Bailey comments on AIs should also refuse to work on capabilities research

Anthony Bailey 30 Oct 2025 1:19 UTC
9 points
8
I have a question on a topic sufficiently adjacent I reckon worth asking here of those likely to read the thread.

It seems that warning shots are more likely unsuccessful because of winner’s curse: that the first models to take a shot will be those who have most badly overestimated their chances, and in turn this correlates with worse intellectual capabilities.

Has there been any illuminating discussion on this and its downstream consequences? E.g. how shots and aftermath are likely in practice to be perceived in general, by the better-informed, and—in the context of this post—by competing AIs? What dynamics result?
- Davidmanheim 30 Oct 2025 11:42 UTC
  2 points
  0
  Parent
  This is a good question, albeit only vaguely adjacent.
  
  My answer would be that winners curse only applies if firms aren’t actively minimizing the extent to which they overbid. In the current scenario, firms are trying (moderately) hard to prevent disaster, just not reliably enough to succeed indefinitely. However, once they fail, we could easily be far past the overhang point for the AI succeeding.
  
  Assuming to start, implausibly, that the AI itself is not strategic enough to consider its chances of succeeding, we’ll assume AI capabilities nonetheless keep increasing. The firms can also detect and prevent it from trying with some probability, but their ability to monitor and stop it from trying is decreasing. The better AI firms are at stopping the models, and the slower that their ability declines relative to model capability, the more likely it is that when they do fail, the AI will succeed. And if the AIs are strategic, they will be much less likely to try if they are likely to either fail or be detected, so they ill wait even longer.