Here’s my attempted phrasing, which might be avoid some of the common confusions:
Suppose we have a model M with utility function ϕ, where M is not capable of taking over the world. Assume that thanks to a bunch of alignment work, ϕ is within δ (by some metric) of humanity’s collective utility function. Then in the process of maximizing ϕ, M ends up doing a bunch of vaguely helpful stuff.
Then someone releases model M′ with utility function ϕ′, where M′ is capable of taking over the world. Suppose that our alignment techniques generalize perfectly. That is, ϕ′ is also within δ of humanity’s collective utility function. Then in the process of maximizing ϕ′, M′ gets rid of humans and rearranges their molecules to satisfy ϕ′ better.
Here’s my attempted phrasing, which might be avoid some of the common confusions:
Suppose we have a model M with utility function ϕ, where M is not capable of taking over the world. Assume that thanks to a bunch of alignment work, ϕ is within δ (by some metric) of humanity’s collective utility function. Then in the process of maximizing ϕ, M ends up doing a bunch of vaguely helpful stuff.
Then someone releases model M′ with utility function ϕ′, where M′ is capable of taking over the world. Suppose that our alignment techniques generalize perfectly. That is, ϕ′ is also within δ of humanity’s collective utility function. Then in the process of maximizing ϕ′, M′ gets rid of humans and rearranges their molecules to satisfy ϕ′ better.
Does this phrasing seem accurate and helpful?