This might be the most valuable article on alignment yet written, IMO. I don’t have enough upvotes. I realize this sounds like hyperbole, so let me explain why I think this.
This is so valuable because of the effort you’ve put into a gears-level model of the AGI at the relevant point. The relevant point is the first time the system has enough intelligence and self-awareness to understand and therefore “lock in” its goals (and around the same point, the intelligence to escape human control if it decides to).
Of course this work builds on a lot of other important work in the past. It might be the most valuable so far because it’s now possible (with sufficient effort) to make gears-level models of the crucial first AGI systems that are close enough to allow correct detailed conclusions about what goals they’d wind up locking in.
If this gears-level model winds up being wrong in important ways, I think the work is still well worthwhile; it’s creating and sharing a model of AGI, and practicing working through that model to determine what goals it would settle on.
I actually think the question of which of those goals can’t be answered given the premise. I think we need more detail about the architecture and training to have much of a guess about what goals would wind up dominating (although strictly following developers intent or closely capturing “the spec” do seem quite unlikely in the scenario you’ve presented).
So I think this model doesn’t yet contain enough gears to allow predicting its behavior (in terms of what goals win out and get locked in or reflectively stable).
Nonetheless, I think this is the work that is most lacking in the field right now: getting down to specifics about the type of systems most likely to become our first takeover-capable AGIs.
Thanks! I’m so glad to hear you like it so much. If you are looking for things to do to help, besides commenting of course, I’d like to improve the post by adding in links to relevant literature + finding researchers to be “hypothesis champions,” i.e. to officially endorse a hypothesis as plausible or likely. In my ideal vision, we’d get the hypothesis champions to say more about what they think and why, and then we’d rewrite the hypothesis section to more accurately represent their view, and then we’d credit them + link to their work. When I find time I’ll do some brainstorming + reach out to people; you are welcome to do so as well.
This might be the most valuable article on alignment yet written, IMO. I don’t have enough upvotes. I realize this sounds like hyperbole, so let me explain why I think this.
This is so valuable because of the effort you’ve put into a gears-level model of the AGI at the relevant point. The relevant point is the first time the system has enough intelligence and self-awareness to understand and therefore “lock in” its goals (and around the same point, the intelligence to escape human control if it decides to).
Of course this work builds on a lot of other important work in the past. It might be the most valuable so far because it’s now possible (with sufficient effort) to make gears-level models of the crucial first AGI systems that are close enough to allow correct detailed conclusions about what goals they’d wind up locking in.
If this gears-level model winds up being wrong in important ways, I think the work is still well worthwhile; it’s creating and sharing a model of AGI, and practicing working through that model to determine what goals it would settle on.
I actually think the question of which of those goals can’t be answered given the premise. I think we need more detail about the architecture and training to have much of a guess about what goals would wind up dominating (although strictly following developers intent or closely capturing “the spec” do seem quite unlikely in the scenario you’ve presented).
So I think this model doesn’t yet contain enough gears to allow predicting its behavior (in terms of what goals win out and get locked in or reflectively stable).
Nonetheless, I think this is the work that is most lacking in the field right now: getting down to specifics about the type of systems most likely to become our first takeover-capable AGIs.
My work is attempting to do the same thing. Seven sources of goals in LLM agents lays out the same problem you present here, while System 2 Alignment works toward answering it.
I’ll leave more object-level discussion in a separate comment.
Thanks! I’m so glad to hear you like it so much. If you are looking for things to do to help, besides commenting of course, I’d like to improve the post by adding in links to relevant literature + finding researchers to be “hypothesis champions,” i.e. to officially endorse a hypothesis as plausible or likely. In my ideal vision, we’d get the hypothesis champions to say more about what they think and why, and then we’d rewrite the hypothesis section to more accurately represent their view, and then we’d credit them + link to their work. When I find time I’ll do some brainstorming + reach out to people; you are welcome to do so as well.