But this was misdirection, we are arguing about how surprised we should be when a competent agent doesn’t learn a very simple lesson after making the mistake several times. Optimality is misdirection, the thing you’re defending is extreme sub-optimality and the thing I’m arguing for is human-level ability-to-correct-mistakes.
I agree that this is the thing we’re arguing about. I do think there’s a reasonable chance that the first AIs which are capable of scary things[1] will have much worse sample efficiency than humans, and as such be much worse than humans at learning from their mistakes. Maybe 30%? Intervening on the propensity of AI agents to do dangerous things because they are overconfident in their model of why the dangerous thing is safe seems very high leverage in such worlds.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
a. Ideally the techniques for reducing the propensity of AI agents to take risks due to overconfidence would be public, such that any frontier org would use them. The organizations deploying the AI don’t want that failure mode, the people asking the AIs to do things don’t want the failure mode, even the AIs themselves (to the extent that they can be modeled as having coherent preferences[2]) don’t want the failure mode. Someone might still do something dumb, but I expect making the tools to avoid that dumb mistake available and easy to use will reduce the chances of that particular dumb failure mode.
b. Unless civilization collapses due to a human or an AI making a catastrophic mistake before then
c. Sure, but I think it makes sense to invest nontrivial resources in the case of “what if the future is basically how you would expect if present trends continued with no surprises”. The exact unsurprising path you project in such a fashion isn’t very likely to pan out, but the plans you make and the tools and organizations you build might be able to be adapted when those surprises do occur.
Basically this entire thread was me disagreeing with
> Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
because I think “stupid” scary AIs are in fact fairly likely, and it would be undignified for us to all die to a “stupid” scary AI accidentally ending the world.
- ^
Concrete examples of the sorts of things I’m thinking of:
Build a more capable successor
Do significant biological engineering
Manage a globally-significant infrastructure project (e.g. “tile the Sahara with solar panels”)
- ^
I think this extent is higher with current LLMs than commonly appreciated, though this is way out of scope for this conversation.
Reality has a surprising amount of detail[1]. If the training objective is improved by better modeling the world, and the model is does not have enough parameters to capture all of the things about the world which would help reduce loss, the model will learn lots of the incidental complexities of the world. As a concrete example, I can ask something like
and the current frontier models know enough about the world that they can, without tools or even any substantial chain of thought, correctly answer that trick question[2]. To be able to answer questions like this from memory, models have to know lots of geographical details about the world.
Unless your technique for extracting a sparse modular world model produces a resulting world model which is larger than the model it came from, I think removing the things which are noise according to your sparse modular model will almost certainly hurt performance on factual recall tasks like this one.
See the essay by that name for some concrete examples.
The trick is that there is second city named Rome in the United States, in the state of Georgia. Both Romes contain a confluence of two rivers, both contain river walks, both contain Mariotts, both contain stadiums, but only the Rome in the US contains a stadium at the confluence of two rivers next to a Mariott named for its proximity to the river.