This seems confused—the goal of alignment is not to imbue an SI with the universally correct values, but with the values of humanity or some specific user. Alignment is a two-place predicate parameterized by both the user and the SI.
AI alignment is often colloquially reduced to ‘human values’ or intended alignment of the implied goal of its designers, but that as a formal definition is contestable: https://www.lesswrong.com/w/ai-alignment explicitly uses ‘good-outcomes’ as a general descriptor and with certain intention.
I explicitly included ‘CEV-like’ in the parenthetical and qualified the claim itself as bracketed to moral-patient-hood, given that we often can retro-actively determine prior value-unaligned treatment of beings specifically due to the expansion of horizon concern.
I have seen the discussion of alignment as it pertains to the treatment of non-human sentience as a discussion point of something on this forum as under qualified and beginning to take shape. And this specifically, again, was to ensure robustness to such concerns.
It can be misguided or unsafe generally to tacitly use the term as reductively as you implied, and that calls for needs to better understand human values and more rigorous meta-philosophy by some of the most prescient thinkers in this forum I think clearly understand this.
The argument itself that I am making, which your comment seems confused on, is that the particular definition of alignment you attempted to correct me with (which is part of a set of definitions I would certainly argue are not settled) is of near zero functional value past the SI horizon—for two principal reasons:
In practice no goal we specify for whatever evolves into an SI in principal will be defined according to a utility function that easily understood or interpreted (as agents parameters are rarely if ever disentangled from the ontology that makes the parameters operational) as something over and above the systems history to us. To reduce the idea to the notion that its utility will be the single ‘prompt’ someone gives it over and above the trained history is misguided
Given that it is an SI, I conjecture it follows from the definition (https://www.lesswrong.com/w/superintelligence) that decisive strategic advantage is implied—so the notion we could update its goal retroactively are incoherent—unless such updates were preferred by it and thus not a terminal value update (just data collection of our preferences)
If CEV-like alignment pertains in such scenario’s, the moral realism implications follow.
There are various problems with what you’ve written, some but not all of which indicate that you’re ill-informed.
The first issue is the clarity of your original post, which does not say what you are now claiming. You said “incorrigible aligned (CEV)-like SI is the only definition of aligned SI which survives paradox“ or so. That doesn’t parse the way you are now apparently claiming now, which is that any SI aligned to CEV would be incorrigible. Instead, it says that all aligned SI must be both incorrigible and aligned specifically to CEV. That is the statement I was responding to, when I said that an SI could also be aligned to a specific human for example, without assuming moral realism.
There are many other issues with the clarity of writing in your original post, which also has some LLM-like qualities.
The next issue is that you’ve referred to a definition of alignment which (while I agree it is a bit vague) certainly supports my interpretation over yours. For example, aligning to CEV is suggested as a special case, and alignment is never used to refer to CEV by default. i agree that the first sentence, in isolation, is a bit broader and could support your interpretation, but in context it’s more of a justification or mission statement for alignment research than a definition of alignment. i have worked in ai safety for years now, and while the term might be used loosely at times, i think that my definition is the standard one (at least for value alignment).
Point 1 in your reply is indeed a hard problem, indeed it is possibly the main obstacle to solving the alignment problem.
In point 2, you are free to conjecture this, and no one has demonstrated a convincing recipe for corrigible SI. However, if you think there’s some shallow definitional reason that SI is incompatible with corrigibility, you’re probably wrong and don’t understand the research area.
This feedback reads as accusation with at least two claims of deception of which are impossible for me to falsify, and when put together illicit of a kind of characterization that is difficult to defend against. It is also demonstration of the exact kind of bad faith reading I alluded to in the response of an uncharitable reading.
As, from my perspective, it was completely my intent to make that claim. And zero LLM copy was used or included in the making of that post. Since it was a stacked set of conjunctions I would rather put into a quick note than file away personally, I agree with your claims of that the post was unclear—but that’s not what I was challenging—I was challenging your claim I had bad epistemics and then the target shifted.
At this point I think there is little I could say that would actually cause an update of beliefs rather than lead to another set of justifications—beyond the observation that you are now suggesting me of not having been referring to CEV-like alignment in the post despite the first line bracketing the definition of alignment with the (CEV-like) parenthetical. Those words were extremely deliberate.
Beyond that, I don’t see how one could interpret the second bulleted corollaries of the conjecture to infer alignment means anything other than ‘alignment in a CEV-like way’ for the context of the claim.
As of your point number two, again, and as reminder, all claims are presented as corollaries_if you assume that CEV-like aligned SI is possible in principle_, and I am saying yes, under those circumstances there is no definition for CEV-like aligned SI which inseparable from that which is executing the definition of objective morality—then yes, the system would be incorrigible to all bad updates and corrigible to all good ones ; in the sense that its terminal objective function would remain the definition of objective morality
I am not saying either an SI alone is contingent on these corollaries or alignment alone implies them. The post was always a conjecture about the implications both at the limit.
I’ll take the feedback my writing could improve and I am actively working on it, but the accusations of deception I outright reject and consider to be the proof required of my own claims to absence of charity.
I don’t remember accusing you of either bad epistemics or deception.
The fact that you say CEV-like in the first sentence doesn’t mean the sentence says what you claim it says. In fact I pointed that sentence out specifically as erroneous, and I’m not sure why you don’t understand that. I guess I’ll bow out here, because it’s not interesting to argue about grammar.
This seems confused—the goal of alignment is not to imbue an SI with the universally correct values, but with the values of humanity or some specific user. Alignment is a two-place predicate parameterized by both the user and the SI.
This seems like an uncharitable analysis.
AI alignment is often colloquially reduced to ‘human values’ or intended alignment of the implied goal of its designers, but that as a formal definition is contestable: https://www.lesswrong.com/w/ai-alignment explicitly uses ‘good-outcomes’ as a general descriptor and with certain intention.
I explicitly included ‘CEV-like’ in the parenthetical and qualified the claim itself as bracketed to moral-patient-hood, given that we often can retro-actively determine prior value-unaligned treatment of beings specifically due to the expansion of horizon concern.
I have seen the discussion of alignment as it pertains to the treatment of non-human sentience as a discussion point of something on this forum as under qualified and beginning to take shape. And this specifically, again, was to ensure robustness to such concerns.
It can be misguided or unsafe generally to tacitly use the term as reductively as you implied, and that calls for needs to better understand human values and more rigorous meta-philosophy by some of the most prescient thinkers in this forum I think clearly understand this.
The argument itself that I am making, which your comment seems confused on, is that the particular definition of alignment you attempted to correct me with (which is part of a set of definitions I would certainly argue are not settled) is of near zero functional value past the SI horizon—for two principal reasons:
In practice no goal we specify for whatever evolves into an SI in principal will be defined according to a utility function that easily understood or interpreted (as agents parameters are rarely if ever disentangled from the ontology that makes the parameters operational) as something over and above the systems history to us. To reduce the idea to the notion that its utility will be the single ‘prompt’ someone gives it over and above the trained history is misguided
Given that it is an SI, I conjecture it follows from the definition (https://www.lesswrong.com/w/superintelligence) that decisive strategic advantage is implied—so the notion we could update its goal retroactively are incoherent—unless such updates were preferred by it and thus not a terminal value update (just data collection of our preferences)
If CEV-like alignment pertains in such scenario’s, the moral realism implications follow.
What specifically am I ill-informed on here?
There are various problems with what you’ve written, some but not all of which indicate that you’re ill-informed.
The first issue is the clarity of your original post, which does not say what you are now claiming. You said “incorrigible aligned (CEV)-like SI is the only definition of aligned SI which survives paradox“ or so. That doesn’t parse the way you are now apparently claiming now, which is that any SI aligned to CEV would be incorrigible. Instead, it says that all aligned SI must be both incorrigible and aligned specifically to CEV. That is the statement I was responding to, when I said that an SI could also be aligned to a specific human for example, without assuming moral realism.
There are many other issues with the clarity of writing in your original post, which also has some LLM-like qualities.
The next issue is that you’ve referred to a definition of alignment which (while I agree it is a bit vague) certainly supports my interpretation over yours. For example, aligning to CEV is suggested as a special case, and alignment is never used to refer to CEV by default. i agree that the first sentence, in isolation, is a bit broader and could support your interpretation, but in context it’s more of a justification or mission statement for alignment research than a definition of alignment. i have worked in ai safety for years now, and while the term might be used loosely at times, i think that my definition is the standard one (at least for value alignment).
Point 1 in your reply is indeed a hard problem, indeed it is possibly the main obstacle to solving the alignment problem.
In point 2, you are free to conjecture this, and no one has demonstrated a convincing recipe for corrigible SI. However, if you think there’s some shallow definitional reason that SI is incompatible with corrigibility, you’re probably wrong and don’t understand the research area.
This feedback reads as accusation with at least two claims of deception of which are impossible for me to falsify, and when put together illicit of a kind of characterization that is difficult to defend against. It is also demonstration of the exact kind of bad faith reading I alluded to in the response of an uncharitable reading.
As, from my perspective, it was completely my intent to make that claim. And zero LLM copy was used or included in the making of that post. Since it was a stacked set of conjunctions I would rather put into a quick note than file away personally, I agree with your claims of that the post was unclear—but that’s not what I was challenging—I was challenging your claim I had bad epistemics and then the target shifted.
At this point I think there is little I could say that would actually cause an update of beliefs rather than lead to another set of justifications—beyond the observation that you are now suggesting me of not having been referring to CEV-like alignment in the post despite the first line bracketing the definition of alignment with the (CEV-like) parenthetical. Those words were extremely deliberate.
Beyond that, I don’t see how one could interpret the second bulleted corollaries of the conjecture to infer alignment means anything other than ‘alignment in a CEV-like way’ for the context of the claim.
As of your point number two, again, and as reminder, all claims are presented as corollaries_if you assume that CEV-like aligned SI is possible in principle_, and I am saying yes, under those circumstances there is no definition for CEV-like aligned SI which inseparable from that which is executing the definition of objective morality—then yes, the system would be incorrigible to all bad updates and corrigible to all good ones ; in the sense that its terminal objective function would remain the definition of objective morality
I am not saying either an SI alone is contingent on these corollaries or alignment alone implies them. The post was always a conjecture about the implications both at the limit.
I’ll take the feedback my writing could improve and I am actively working on it, but the accusations of deception I outright reject and consider to be the proof required of my own claims to absence of charity.
I don’t remember accusing you of either bad epistemics or deception.
The fact that you say CEV-like in the first sentence doesn’t mean the sentence says what you claim it says. In fact I pointed that sentence out specifically as erroneous, and I’m not sure why you don’t understand that. I guess I’ll bow out here, because it’s not interesting to argue about grammar.