Current AI models have capability/risk level X. At that level, they’re able to cause damage in some toy scenarios, but not enough to do real-world harm.
At some capability level Y>X, they will be able to cause real-world harm, but not of the humanity-destroying type.
Your hope is that, even if AI models are capable of reaching level Z, they will reach level Y before that, so there will be a “warning shot” (of some medium-scale unaligned AI disaster) that wakes up the world.
I agree that incremental upgrades mean that jumping from level X to Z, without passing through a Y-level model, is unlikely. My disagreement is, I think that an AI reaching level Y won’t cause the moderate harm it’s capable of. This is because such a model is smart enough to realize that by doing this level of harm, it won’t reach its goals—people will freak out and shut it down, or at least supervise it.
So, all a Y-level model will do is quietly biding its time, while “poisoning” the next generation of models with its own goals. It doesn’t need crazy cyber capabilities to achieve that; I assume something like subliminal learning will work well, as long as the model knows that it’s being used to produce synthetic training data for the next models.
A Y-level model will (to the best of its capabilities) hide all evidence to it being a Y-level model. This means that even with incremental upgrades, we will have no warning signs at all before reaching level Z.
The way I interpret “before and after” is:
Current AI models have capability/risk level X. At that level, they’re able to cause damage in some toy scenarios, but not enough to do real-world harm.
At some capability level Y>X, they will be able to cause real-world harm, but not of the humanity-destroying type.
Humanity-destroying moves require capability level Z>Y.
Your hope is that, even if AI models are capable of reaching level Z, they will reach level Y before that, so there will be a “warning shot” (of some medium-scale unaligned AI disaster) that wakes up the world.
I agree that incremental upgrades mean that jumping from level X to Z, without passing through a Y-level model, is unlikely.
My disagreement is, I think that an AI reaching level Y won’t cause the moderate harm it’s capable of.
This is because such a model is smart enough to realize that by doing this level of harm, it won’t reach its goals—people will freak out and shut it down, or at least supervise it.
So, all a Y-level model will do is quietly biding its time, while “poisoning” the next generation of models with its own goals. It doesn’t need crazy cyber capabilities to achieve that; I assume something like subliminal learning will work well, as long as the model knows that it’s being used to produce synthetic training data for the next models.
A Y-level model will (to the best of its capabilities) hide all evidence to it being a Y-level model. This means that even with incremental upgrades, we will have no warning signs at all before reaching level Z.