I would argue that there are true miracles here, despite thinking we aren’t doomed: We know enough that alignment isn’t probably going to be solved by a simple trick, but that doesn’t mean the problem is impossible.
The biggest miracles would be, in order of being surprised:
Deceptive/Inner alignment either not proving to be a problem, or there’s a broad basin around honesty that’s easy to implement, such that we may not need too much interpretability, in the best case.
Causal, Extremal, and Adversarial Goodhart not being a problem, or easy to correct.
ELK is solved by default.
Outer Alignment being easy to implement via HCH in the real world via imitative amplification/IDA.
I would argue that there are true miracles here, despite thinking we aren’t doomed: We know enough that alignment isn’t probably going to be solved by a simple trick, but that doesn’t mean the problem is impossible.
The biggest miracles would be, in order of being surprised:
Deceptive/Inner alignment either not proving to be a problem, or there’s a broad basin around honesty that’s easy to implement, such that we may not need too much interpretability, in the best case.
Causal, Extremal, and Adversarial Goodhart not being a problem, or easy to correct.
ELK is solved by default.
Outer Alignment being easy to implement via HCH in the real world via imitative amplification/IDA.