My apologies for challenging the premise, but I don’t understand how anyone could hope to be “convinced” that humanity isn’t doomed by AGI unless they’re in possession of a provably safe design that they have high confidence of being able to implement ahead of any rivals.
Put aside all of the assumptions you think the pessimists are making and simply ask whether humanity knows how to make a mind that will share our values. It it does, please tell us how. If it doesn’t, then accept that any AGI we make is, by default, alien—and building an AGI is like opening a random portal to invite an alien mind to come play with us.
What is your prior for alien intelligence playing nice with humanity—or for humanity being able to defeat it? I don’t think it’s wrong to say we’re not automatically doomed. But let’s suppose we open a portal and it turns out ok: We share tea and cookies with the alien, or we blow its brains out. Whatever. What’s to stop humanity from rolling the dice on another random portal? And another? Unless we just happen to stumble on a friendly alien that will also prevent all new portals, we should expect to eventually summon something we can’t handle.
Feel free to place wagers on whether humanity can figure out alignment before getting a bad roll. You might decide you like your odds! But don’t confuse a wager with a solution.
This is important work.
One suggested tweak: I notice this document starts leaning on the term “loss” in section 4.2 but doesn’t tell the reader what that means in this context until 4.3
Something similar happens with the concept of “weights”, first used in section 1.3, but only sort-of-explained later, in 4.2.
Speaking of weights, I notice myself genuinely confused in section 5.2, and I’m not sure if it’s a problem with the wording or with my current mental model (which is only semi-technical). A quoted forecast reads:
Wouldn’t the model doing the sharing have, by definition, different weights than the recipient? (How is a model’s “knowledge” stored if not in the weights? ) My best guess: shareable “knowledge” would take the form of vectors over the models’ common foundational base weights—which should work as long as there hasn’t been too much other divergence since the fork. Is that right? And if so, is there some reason this is a forecast capability and not a current one?