In this post, as well as your other posts, you use the word “goal” a lot, as well as related words, phrases, and ideas: “target”, “outcomes”, “alignment ultimately is about making sure that the first SGCA pursues desirable goal”, the idea of backchaining, “save the world” (this last one, in particular, implies that the world can be “saved”, like in a movie, that implies some finitude of the story).
Agent foundations/formal-goal alignment is not fundamentally about doing math or being theoretical or thinking abstractly or proving things. Agent foundations/formal-goal alignment is about building a coherent target which is fully made of math — not of human words with unspecified meaning — and figuring out a way to make that target maximized by AI. Formal-goal alignment is about building a fully formalized goal, not about going about things in a “formal” manner.
This is a confusing passage because what you describe is basically building a satisfactory scientific theory of intelligence (at least, of a specific kind or architecture). As a scientific process, this is about “doing math”, “being theoretical” and “thinking abstractly”, etc. Then the scientific theory should be turned into (or developed in parallel with) an accompanying engineering theory for the design of AI, its training data curation, training procedure, post-training monitoring, interpretability, and alignment protocols, etc. Neither the first nor the second part of this R&D process, even if discernible from each other, are not “fundamental”, but they are both essential.
Current AI technologies are not strong agents pursuing a coherent goal (SGCA). The reason for this is not because this kind of technology is impossible or too confusing to build, but because in worlds in which SGCA was built (and wasn’t aligned), we die.
At face value, this passage makes an unverifiable claim about parallel branches of the multiverse (how do you know that people die in other worlds?) and then uses this claim as an argument for short timelines/SGCA being relatively easily achievable from now. This makes little sense to me: I don’t think the history of technological development is such that we should already wonder why are we still alive. On the other hand, you don’t need to argue for short timelines/SGCA being relatively easily achievable from now in such a convoluted way: this view is well-respectable anyway and doesn’t really require justification at all. Plenty of people, from Connor Leahy to Geoffrey Hinton now, have short timelines.
You do not align AI; you build aligned AI.
Both. I agree that we have to design AI such that it has inductive (learning) priors such that AI learns world models that are structurally similar to people’s world models: this makes alignment easier; and I agree that the actual model-alignment process should start early in the AI training (i.e., development), rather than only after pre-training (and people already do this, albeit with the current Transformer architecture). But we also need to align AIs continuously during deployment. Prompt engineering is the “last mile” of alignment.
AI goals based on trying to point to things we care about inside the AI’s model are the wrong way to go about things, because they’re susceptible to ontology breaks and to failing to carry over to next steps of self-improvements that an world-saving-AI should want to go through.
Instead, the aligned goal we should be putting together should be eventually aligned; it should be aligned starting from a certain point (which we’d then have to ensure the system we launch is already past), rather than up to a certain point.
This passage seems self-contradictory to me, because “aligned goal” will become “misaligned” when the systems’ (e.g., humans and AIs) world models and values drift apart. There is no such goal, whatever that is, that could magically prevent this from happening. I would agree with this passage if the first sentence of the second paragraph would read as “the alignment protocols and principles we should be putting together should lead to robust continuous alignment of humans’ and AIs’ models of the world”.
The aligned goal should be “formal”. It should be made of fully formalized math, not of human concepts that an AI has to interpret in its ontology, because ontologies break and reshape as the AI learns and changes. the aligned goal should have the factual property that a computationally unbounded mathematical oracle being given that goal would take desirable actions [...]
This passage seems to continue the confusion of the previous passage. “Math” doesn’t “factually” guarantee the desirability of actions; all parts of the system’s ontology could drift, in principle, arbitrarily far, including math foundations themselves. Thus we should always be talking about a continuous alignment process, rather than “solving alignment” via some clever math principle which would allow us to then be “done”.
An interesting aside here is that humans are themselves not aligned even on such a “clear” thing as foundations of mathematics. Otherwise, the philosophy of mathematics would already be a “dead” area of philosophy by now, but it’s very much not. There is a lot of relatively recent work in the foundations of mathematics, e.g., univalent foundations.
In this post, as well as your other posts, you use the word “goal” a lot, as well as related words, phrases, and ideas: “target”, “outcomes”, “alignment ultimately is about making sure that the first SGCA pursues desirable goal”, the idea of backchaining, “save the world” (this last one, in particular, implies that the world can be “saved”, like in a movie, that implies some finitude of the story).
I think this is not the best view of the world. I think this view misses the latest developments in the physics of evolution and regulative development, evolutionary game theory, open-endedness (including in RL: cf. “General intelligence requires rethinking exploration”), and relational ethics. All these developments inform a more system-based and process-based (processes are behaviours of systems and games played by systems) view. Under this view, goal alignment is secondary to (methodological and scientific) discipline/competency/skill/praxis/virtue alignment.
This is a confusing passage because what you describe is basically building a satisfactory scientific theory of intelligence (at least, of a specific kind or architecture). As a scientific process, this is about “doing math”, “being theoretical” and “thinking abstractly”, etc. Then the scientific theory should be turned into (or developed in parallel with) an accompanying engineering theory for the design of AI, its training data curation, training procedure, post-training monitoring, interpretability, and alignment protocols, etc. Neither the first nor the second part of this R&D process, even if discernible from each other, are not “fundamental”, but they are both essential.
At face value, this passage makes an unverifiable claim about parallel branches of the multiverse (how do you know that people die in other worlds?) and then uses this claim as an argument for short timelines/SGCA being relatively easily achievable from now. This makes little sense to me: I don’t think the history of technological development is such that we should already wonder why are we still alive. On the other hand, you don’t need to argue for short timelines/SGCA being relatively easily achievable from now in such a convoluted way: this view is well-respectable anyway and doesn’t really require justification at all. Plenty of people, from Connor Leahy to Geoffrey Hinton now, have short timelines.
Both. I agree that we have to design AI such that it has inductive (learning) priors such that AI learns world models that are structurally similar to people’s world models: this makes alignment easier; and I agree that the actual model-alignment process should start early in the AI training (i.e., development), rather than only after pre-training (and people already do this, albeit with the current Transformer architecture). But we also need to align AIs continuously during deployment. Prompt engineering is the “last mile” of alignment.
This passage seems self-contradictory to me, because “aligned goal” will become “misaligned” when the systems’ (e.g., humans and AIs) world models and values drift apart. There is no such goal, whatever that is, that could magically prevent this from happening. I would agree with this passage if the first sentence of the second paragraph would read as “the alignment protocols and principles we should be putting together should lead to robust continuous alignment of humans’ and AIs’ models of the world”.
This passage seems to continue the confusion of the previous passage. “Math” doesn’t “factually” guarantee the desirability of actions; all parts of the system’s ontology could drift, in principle, arbitrarily far, including math foundations themselves. Thus we should always be talking about a continuous alignment process, rather than “solving alignment” via some clever math principle which would allow us to then be “done”.
An interesting aside here is that humans are themselves not aligned even on such a “clear” thing as foundations of mathematics. Otherwise, the philosophy of mathematics would already be a “dead” area of philosophy by now, but it’s very much not. There is a lot of relatively recent work in the foundations of mathematics, e.g., univalent foundations.