As a brief aside, I don’t think there is a single “introduction to AI x-risk” resource that rigorously and compellingly presents, from start to finish, the core arguments around AI x-risk.
In general, +1 for the post, although “miracles” doesn’t feel like the right description of them, more like “reasons for hope”. (In general I’m pretty unimpressed by the “miracles” terminology, since nobody has a model of the future of AI robust enough such that it’s anywhere near reasonable to call violations of that model “miracles”).
I would argue that there are true miracles here, despite thinking we aren’t doomed: We know enough that alignment isn’t probably going to be solved by a simple trick, but that doesn’t mean the problem is impossible.
The biggest miracles would be, in order of being surprised:
Deceptive/Inner alignment either not proving to be a problem, or there’s a broad basin around honesty that’s easy to implement, such that we may not need too much interpretability, in the best case.
Causal, Extremal, and Adversarial Goodhart not being a problem, or easy to correct.
ELK is solved by default.
Outer Alignment being easy to implement via HCH in the real world via imitative amplification/IDA.
Any specific things you think The Alignment Problem from a Deep Learning Perspective misses?
In general, +1 for the post, although “miracles” doesn’t feel like the right description of them, more like “reasons for hope”. (In general I’m pretty unimpressed by the “miracles” terminology, since nobody has a model of the future of AI robust enough such that it’s anywhere near reasonable to call violations of that model “miracles”).
I would argue that there are true miracles here, despite thinking we aren’t doomed: We know enough that alignment isn’t probably going to be solved by a simple trick, but that doesn’t mean the problem is impossible.
The biggest miracles would be, in order of being surprised:
Deceptive/Inner alignment either not proving to be a problem, or there’s a broad basin around honesty that’s easy to implement, such that we may not need too much interpretability, in the best case.
Causal, Extremal, and Adversarial Goodhart not being a problem, or easy to correct.
ELK is solved by default.
Outer Alignment being easy to implement via HCH in the real world via imitative amplification/IDA.