faul_sname comments on Tim Hua’s Shortform

faul_sname 23 Dec 2025 21:12 UTC
2 points
0
But this was misdirection, we are arguing about how surprised we should be when a competent agent doesn’t learn a very simple lesson after making the mistake several times. Optimality is misdirection, the thing you’re defending is extreme sub-optimality and the thing I’m arguing for is human-level ability-to-correct-mistakes.
I agree that this is the thing we’re arguing about. I do think there’s a reasonable chance that the first AIs which are capable of scary things^[1] will have much worse sample efficiency than humans, and as such be much worse than humans at learning from their mistakes. Maybe 30%? Intervening on the propensity of AI agents to do dangerous things because they are overconfident in their model of why the dangerous thing is safe seems very high leverage in such worlds.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
a. Ideally the techniques for reducing the propensity of AI agents to take risks due to overconfidence would be public, such that any frontier org would use them. The organizations deploying the AI don’t want that failure mode, the people asking the AIs to do things don’t want the failure mode, even the AIs themselves (to the extent that they can be modeled as having coherent preferences^[2]) don’t want the failure mode. Someone might still do something dumb, but I expect making the tools to avoid that dumb mistake available and easy to use will reduce the chances of that particular dumb failure mode.
b. Unless civilization collapses due to a human or an AI making a catastrophic mistake before then
c. Sure, but I think it makes sense to invest nontrivial resources in the case of “what if the future is basically how you would expect if present trends continued with no surprises”. The exact unsurprising path you project in such a fashion isn’t very likely to pan out, but the plans you make and the tools and organizations you build might be able to be adapted when those surprises do occur.
Basically this entire thread was me disagreeing with

> Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
because I think “stupid” scary AIs are in fact fairly likely, and it would be undignified for us to all die to a “stupid” scary AI accidentally ending the world.
1. ^
  Concrete examples of the sorts of things I’m thinking of:
  Build a more capable successor
  Do significant biological engineering
  Manage a globally-significant infrastructure project (e.g. “tile the Sahara with solar panels”)
2. ^
  I think this extent is higher with current LLMs than commonly appreciated, though this is way out of scope for this conversation.