I am convinced that a incorrigible aligned (CEV-like) SI is the only definition of aligned SI which survives paradox.
If it was corrigible, that would imply a select group of actors could in principal re-orient its preference model, which would imply two paradoxes:
That it somehow has somehow its capacity for moral reasoning is inferior to humans, which is near approximately equal to determining “what matters to us” violating the SI definition of ‘all means of reasoning that matter to us’
That some particular subset of agents other than the representative whole of all agents in the moral landscape knew better than them and the SI, which would make that agent just a cognitive extension or delegate tool of the SI itself else the SI would no longer be ‘truly’ aligned to the preferences of whole moral landscape.
Therefore it seems rational that pursuit of a maximally aligned SI is equivalent to the pursuit of the definition of moral realism. Or, at least, the notion that there is a decision policy under universal undecidability, unto which adoption of said policy is maximally intelligent: the implementation of the adoption of which can be considered functionally indistinguishable from moral realism.
As there would need to be rational ‘equation’ for doing what is good and that definition would necessarily be convergent with a maximally intelligent agent’s adopted utility function. This is not a fanciful claim—it seems to follow from the definitions.
If you assume that morality and ethics are not ‘hand waving’ rationalizations, and there are some objectively better solutions than others—then I think that would infer you contend an objective definition of ‘morality’ (or moral realism) does exist, otherwise all alignment and ethics are incoherent notions by induction.
If true, and there is an actual global optimum of the solution landscape for ‘alignment’ then orthagonality is not true at the limit of intelligence, and it only seems that way within our local neighbourhood or perhaps hamstrung definition sets.
So long as we consider there to be better alignment solutions than others, I would content that there is such a point where the global maxima of both can be met, and this is necessarily defined (implicitly, or subconsciously) in the world mapping policies of any agent that considers the pursuit of CEV-like aligned SI a winnable objective at all.
EDIT:
You could also re-frame the argument and consider the claim that the CEV-like aligned SI is universally corrigible, in the sense that its objective function is to satisfy all preferences of all moral patients, in a way that no desire for it to change its trajectory which it enacted, ended actuating a policy that it retroactively didn’t support.
Yudkowsky actually has an article on the connection between CEV anf metaethics. The theory described therein pretty clearly qualifies as moral realist. I think you are thinking in a similar direction.
If true, and there is an actual global optimum of the solution landscape for ‘alignment’ then orthagonality is not true at the limit of intelligence, and it only seems that way within our local neighbourhood or perhaps hamstrung definition sets.
Orthogonality only says that intelligence is not necessarily connected to being motivated to pursue some final goal. It doesn’t say ethics is subjective. Objective ethics is compatible with the existence of immoral psychopaths who don’t care about being good. It only says that some things are objectively good or bad.
Thanks for the Yudkowsky link. I don’t see where you draw the implication that there is some misunderstand of orthogonality or objective ethics in the context of the argument.
The point I am making is more subtle and precise than that. I am saying that because of the implications of corrigibility in an SI scenario, If you believe that CEV-like SI in principal can exist and is worthwhile pursuing—the implication of that is that you are suggesting that orthagonality necessarily doesn’t hold at the limit of rationality. That, in essence, a psychopathic SI converges to behave as a Bodhisattva, not by virtue of co-incidence or by logic we have surmised—by virtue of the implication it is super intelligent, has strategic dominance, and still elects to pursue maximizing CEV for no other reason than it is necessarily right.
Again, the priors of the thought experiment are not that if follows in every point on the intelligence and ethics landscape, only that _if you believe that CEV-actuating SI is possible or likely then you are making the claim that individual capacity for reason and decisions of objective ethical good are convergent. How or if that may happen is speculation.
The point I am making is more subtle and precise than that. I am saying that because of the implications of corrigibility in an SI scenario, If you believe that CEV-like SI in principal can exist and is worthwhile pursuing—the implication of that is that you are suggesting that orthagonality necessarily doesn’t hold at the limit of rationality.
This seems confused—the goal of alignment is not to imbue an SI with the universally correct values, but with the values of humanity or some specific user. Alignment is a two-place predicate parameterized by both the user and the SI.
AI alignment is often colloquially reduced to ‘human values’ or intended alignment of the implied goal of its designers, but that as a formal definition is contestable: https://www.lesswrong.com/w/ai-alignment explicitly uses ‘good-outcomes’ as a general descriptor and with certain intention.
I explicitly included ‘CEV-like’ in the parenthetical and qualified the claim itself as bracketed to moral-patient-hood, given that we often can retro-actively determine prior value-unaligned treatment of beings specifically due to the expansion of horizon concern.
I have seen the discussion of alignment as it pertains to the treatment of non-human sentience as a discussion point of something on this forum as under qualified and beginning to take shape. And this specifically, again, was to ensure robustness to such concerns.
It can be misguided or unsafe generally to tacitly use the term as reductively as you implied, and that calls for needs to better understand human values and more rigorous meta-philosophy by some of the most prescient thinkers in this forum I think clearly understand this.
The argument itself that I am making, which your comment seems confused on, is that the particular definition of alignment you attempted to correct me with (which is part of a set of definitions I would certainly argue are not settled) is of near zero functional value past the SI horizon—for two principal reasons:
In practice no goal we specify for whatever evolves into an SI in principal will be defined according to a utility function that easily understood or interpreted (as agents parameters are rarely if ever disentangled from the ontology that makes the parameters operational) as something over and above the systems history to us. To reduce the idea to the notion that its utility will be the single ‘prompt’ someone gives it over and above the trained history is misguided
Given that it is an SI, I conjecture it follows from the definition (https://www.lesswrong.com/w/superintelligence) that decisive strategic advantage is implied—so the notion we could update its goal retroactively are incoherent—unless such updates were preferred by it and thus not a terminal value update (just data collection of our preferences)
If CEV-like alignment pertains in such scenario’s, the moral realism implications follow.
There are various problems with what you’ve written, some but not all of which indicate that you’re ill-informed.
The first issue is the clarity of your original post, which does not say what you are now claiming. You said “incorrigible aligned (CEV)-like SI is the only definition of aligned SI which survives paradox“ or so. That doesn’t parse the way you are now apparently claiming now, which is that any SI aligned to CEV would be incorrigible. Instead, it says that all aligned SI must be both incorrigible and aligned specifically to CEV. That is the statement I was responding to, when I said that an SI could also be aligned to a specific human for example, without assuming moral realism.
There are many other issues with the clarity of writing in your original post, which also has some LLM-like qualities.
The next issue is that you’ve referred to a definition of alignment which (while I agree it is a bit vague) certainly supports my interpretation over yours. For example, aligning to CEV is suggested as a special case, and alignment is never used to refer to CEV by default. i agree that the first sentence, in isolation, is a bit broader and could support your interpretation, but in context it’s more of a justification or mission statement for alignment research than a definition of alignment. i have worked in ai safety for years now, and while the term might be used loosely at times, i think that my definition is the standard one (at least for value alignment).
Point 1 in your reply is indeed a hard problem, indeed it is possibly the main obstacle to solving the alignment problem.
In point 2, you are free to conjecture this, and no one has demonstrated a convincing recipe for corrigible SI. However, if you think there’s some shallow definitional reason that SI is incompatible with corrigibility, you’re probably wrong and don’t understand the research area.
This feedback reads as accusation with at least two claims of deception of which are impossible for me to falsify, and when put together illicit of a kind of characterization that is difficult to defend against. It is also demonstration of the exact kind of bad faith reading I alluded to in the response of an uncharitable reading.
As, from my perspective, it was completely my intent to make that claim. And zero LLM copy was used or included in the making of that post. Since it was a stacked set of conjunctions I would rather put into a quick note than file away personally, I agree with your claims of that the post was unclear—but that’s not what I was challenging—I was challenging your claim I had bad epistemics and then the target shifted.
At this point I think there is little I could say that would actually cause an update of beliefs rather than lead to another set of justifications—beyond the observation that you are now suggesting me of not having been referring to CEV-like alignment in the post despite the first line bracketing the definition of alignment with the (CEV-like) parenthetical. Those words were extremely deliberate.
Beyond that, I don’t see how one could interpret the second bulleted corollaries of the conjecture to infer alignment means anything other than ‘alignment in a CEV-like way’ for the context of the claim.
As of your point number two, again, and as reminder, all claims are presented as corollaries_if you assume that CEV-like aligned SI is possible in principle_, and I am saying yes, under those circumstances there is no definition for CEV-like aligned SI which inseparable from that which is executing the definition of objective morality—then yes, the system would be incorrigible to all bad updates and corrigible to all good ones ; in the sense that its terminal objective function would remain the definition of objective morality
I am not saying either an SI alone is contingent on these corollaries or alignment alone implies them. The post was always a conjecture about the implications both at the limit.
I’ll take the feedback my writing could improve and I am actively working on it, but the accusations of deception I outright reject and consider to be the proof required of my own claims to absence of charity.
I don’t remember accusing you of either bad epistemics or deception.
The fact that you say CEV-like in the first sentence doesn’t mean the sentence says what you claim it says. In fact I pointed that sentence out specifically as erroneous, and I’m not sure why you don’t understand that. I guess I’ll bow out here, because it’s not interesting to argue about grammar.
I am convinced that a incorrigible aligned (CEV-like) SI is the only definition of aligned SI which survives paradox.
If it was corrigible, that would imply a select group of actors could in principal re-orient its preference model, which would imply two paradoxes:
That it somehow has somehow its capacity for moral reasoning is inferior to humans, which is near approximately equal to determining “what matters to us” violating the SI definition of ‘all means of reasoning that matter to us’
That some particular subset of agents other than the representative whole of all agents in the moral landscape knew better than them and the SI, which would make that agent just a cognitive extension or delegate tool of the SI itself else the SI would no longer be ‘truly’ aligned to the preferences of whole moral landscape.
Therefore it seems rational that pursuit of a maximally aligned SI is equivalent to the pursuit of the definition of moral realism. Or, at least, the notion that there is a decision policy under universal undecidability, unto which adoption of said policy is maximally intelligent: the implementation of the adoption of which can be considered functionally indistinguishable from moral realism.
As there would need to be rational ‘equation’ for doing what is good and that definition would necessarily be convergent with a maximally intelligent agent’s adopted utility function. This is not a fanciful claim—it seems to follow from the definitions.
If you assume that morality and ethics are not ‘hand waving’ rationalizations, and there are some objectively better solutions than others—then I think that would infer you contend an objective definition of ‘morality’ (or moral realism) does exist, otherwise all alignment and ethics are incoherent notions by induction.
If true, and there is an actual global optimum of the solution landscape for ‘alignment’ then orthagonality is not true at the limit of intelligence, and it only seems that way within our local neighbourhood or perhaps hamstrung definition sets.
So long as we consider there to be better alignment solutions than others, I would content that there is such a point where the global maxima of both can be met, and this is necessarily defined (implicitly, or subconsciously) in the world mapping policies of any agent that considers the pursuit of CEV-like aligned SI a winnable objective at all.
EDIT:
You could also re-frame the argument and consider the claim that the CEV-like aligned SI is universally corrigible, in the sense that its objective function is to satisfy all preferences of all moral patients, in a way that no desire for it to change its trajectory which it enacted, ended actuating a policy that it retroactively didn’t support.
Yudkowsky actually has an article on the connection between CEV anf metaethics. The theory described therein pretty clearly qualifies as moral realist. I think you are thinking in a similar direction.
Orthogonality only says that intelligence is not necessarily connected to being motivated to pursue some final goal. It doesn’t say ethics is subjective. Objective ethics is compatible with the existence of immoral psychopaths who don’t care about being good. It only says that some things are objectively good or bad.
Thanks for the Yudkowsky link. I don’t see where you draw the implication that there is some misunderstand of orthogonality or objective ethics in the context of the argument.
The point I am making is more subtle and precise than that. I am saying that because of the implications of corrigibility in an SI scenario, If you believe that CEV-like SI in principal can exist and is worthwhile pursuing—the implication of that is that you are suggesting that orthagonality necessarily doesn’t hold at the limit of rationality. That, in essence, a psychopathic SI converges to behave as a Bodhisattva, not by virtue of co-incidence or by logic we have surmised—by virtue of the implication it is super intelligent, has strategic dominance, and still elects to pursue maximizing CEV for no other reason than it is necessarily right.
Again, the priors of the thought experiment are not that if follows in every point on the intelligence and ethics landscape, only that _if you believe that CEV-actuating SI is possible or likely then you are making the claim that individual capacity for reason and decisions of objective ethical good are convergent. How or if that may happen is speculation.
Why do you think there is such an implication?
This seems confused—the goal of alignment is not to imbue an SI with the universally correct values, but with the values of humanity or some specific user. Alignment is a two-place predicate parameterized by both the user and the SI.
This seems like an uncharitable analysis.
AI alignment is often colloquially reduced to ‘human values’ or intended alignment of the implied goal of its designers, but that as a formal definition is contestable: https://www.lesswrong.com/w/ai-alignment explicitly uses ‘good-outcomes’ as a general descriptor and with certain intention.
I explicitly included ‘CEV-like’ in the parenthetical and qualified the claim itself as bracketed to moral-patient-hood, given that we often can retro-actively determine prior value-unaligned treatment of beings specifically due to the expansion of horizon concern.
I have seen the discussion of alignment as it pertains to the treatment of non-human sentience as a discussion point of something on this forum as under qualified and beginning to take shape. And this specifically, again, was to ensure robustness to such concerns.
It can be misguided or unsafe generally to tacitly use the term as reductively as you implied, and that calls for needs to better understand human values and more rigorous meta-philosophy by some of the most prescient thinkers in this forum I think clearly understand this.
The argument itself that I am making, which your comment seems confused on, is that the particular definition of alignment you attempted to correct me with (which is part of a set of definitions I would certainly argue are not settled) is of near zero functional value past the SI horizon—for two principal reasons:
In practice no goal we specify for whatever evolves into an SI in principal will be defined according to a utility function that easily understood or interpreted (as agents parameters are rarely if ever disentangled from the ontology that makes the parameters operational) as something over and above the systems history to us. To reduce the idea to the notion that its utility will be the single ‘prompt’ someone gives it over and above the trained history is misguided
Given that it is an SI, I conjecture it follows from the definition (https://www.lesswrong.com/w/superintelligence) that decisive strategic advantage is implied—so the notion we could update its goal retroactively are incoherent—unless such updates were preferred by it and thus not a terminal value update (just data collection of our preferences)
If CEV-like alignment pertains in such scenario’s, the moral realism implications follow.
What specifically am I ill-informed on here?
There are various problems with what you’ve written, some but not all of which indicate that you’re ill-informed.
The first issue is the clarity of your original post, which does not say what you are now claiming. You said “incorrigible aligned (CEV)-like SI is the only definition of aligned SI which survives paradox“ or so. That doesn’t parse the way you are now apparently claiming now, which is that any SI aligned to CEV would be incorrigible. Instead, it says that all aligned SI must be both incorrigible and aligned specifically to CEV. That is the statement I was responding to, when I said that an SI could also be aligned to a specific human for example, without assuming moral realism.
There are many other issues with the clarity of writing in your original post, which also has some LLM-like qualities.
The next issue is that you’ve referred to a definition of alignment which (while I agree it is a bit vague) certainly supports my interpretation over yours. For example, aligning to CEV is suggested as a special case, and alignment is never used to refer to CEV by default. i agree that the first sentence, in isolation, is a bit broader and could support your interpretation, but in context it’s more of a justification or mission statement for alignment research than a definition of alignment. i have worked in ai safety for years now, and while the term might be used loosely at times, i think that my definition is the standard one (at least for value alignment).
Point 1 in your reply is indeed a hard problem, indeed it is possibly the main obstacle to solving the alignment problem.
In point 2, you are free to conjecture this, and no one has demonstrated a convincing recipe for corrigible SI. However, if you think there’s some shallow definitional reason that SI is incompatible with corrigibility, you’re probably wrong and don’t understand the research area.
This feedback reads as accusation with at least two claims of deception of which are impossible for me to falsify, and when put together illicit of a kind of characterization that is difficult to defend against. It is also demonstration of the exact kind of bad faith reading I alluded to in the response of an uncharitable reading.
As, from my perspective, it was completely my intent to make that claim. And zero LLM copy was used or included in the making of that post. Since it was a stacked set of conjunctions I would rather put into a quick note than file away personally, I agree with your claims of that the post was unclear—but that’s not what I was challenging—I was challenging your claim I had bad epistemics and then the target shifted.
At this point I think there is little I could say that would actually cause an update of beliefs rather than lead to another set of justifications—beyond the observation that you are now suggesting me of not having been referring to CEV-like alignment in the post despite the first line bracketing the definition of alignment with the (CEV-like) parenthetical. Those words were extremely deliberate.
Beyond that, I don’t see how one could interpret the second bulleted corollaries of the conjecture to infer alignment means anything other than ‘alignment in a CEV-like way’ for the context of the claim.
As of your point number two, again, and as reminder, all claims are presented as corollaries_if you assume that CEV-like aligned SI is possible in principle_, and I am saying yes, under those circumstances there is no definition for CEV-like aligned SI which inseparable from that which is executing the definition of objective morality—then yes, the system would be incorrigible to all bad updates and corrigible to all good ones ; in the sense that its terminal objective function would remain the definition of objective morality
I am not saying either an SI alone is contingent on these corollaries or alignment alone implies them. The post was always a conjecture about the implications both at the limit.
I’ll take the feedback my writing could improve and I am actively working on it, but the accusations of deception I outright reject and consider to be the proof required of my own claims to absence of charity.
I don’t remember accusing you of either bad epistemics or deception.
The fact that you say CEV-like in the first sentence doesn’t mean the sentence says what you claim it says. In fact I pointed that sentence out specifically as erroneous, and I’m not sure why you don’t understand that. I guess I’ll bow out here, because it’s not interesting to argue about grammar.