This is what I quoted in the Obliqueness post and elaborated on:
As an example of the computational tractability consideration, sufficiently complex goals may only be well-represented by sufficiently intelligent agents. “Complication” may be reflected in, for example, code complexity; to my mind, the strong form implies that the code complexity of an agent with a given level of intelligence and goals is approximately the code complexity of the intelligence plus the code complexity of the goal specification, plus a constant. Code complexity would influence statistical likelihood for the usual Kolmogorov/Solomonoff reasons, of course.
And I wasn’t thinking it was dependent on “post self alignment”, it could also apply to the construction itself being less natural. It’s possible Eliezer meant something different from what I meant by strong orthogonality, but I hope it is apparent why I and others would interpret it to be a non-trivial claim, rather than a slight variation on weak orthogonality.
See measuring intelligence and reverse-engineering goals for some more of my thinking on this. Relative to a given cognitive architecture, ~everyone agrees that there are especially stupid goals, the interesting question is whether there are especially smart goals; I think probably yes. Hence maybe weak orthogonality would route through variations between architectures (rather than within an architecture) to hit all points (intelligence, goals), and maybe sufficiently high intelligence levels are only compatible with a narrow range of goals (which would perhaps contradict weak orthogonality, but maybe not in an important way, and the general shape of cognitive architecture / goal / intelligence correlation matters more)
I take OP to be disagreeing with strong orthogonality, at least my interpretation of it from the obliqueness post:
Here strong orthogonality looks too neat. It imagines the agent’s ontology updating while its final target remains untouched by the update: if goals are expressed in an ontology, and intelligence changes the ontology, then intelligence and goals are correlated.
This is something I’d agree with: Goals being expressed in the ontology of the cognitive architecture are less complicated than goals expressed in a very different ontology that the agent doesn’t believe in. This seems like a “complication” in the sense of strong orthogonality. (I get maybe you don’t interpret this way based on close reading of the orthogonality post! But nonetheless I think my reading is reasonable.)
It’s not a completely crazy interpretation. I don’t think it’s super reasonable based on the text.
Anyway, maybe there still is a substantive disagreement here. I would claim that
There’s probably such a thing as “understanding alignment”. (Acknowleding that this is very ambiguous.)
It’s likely that IF you understood alignment, THEN for some large class of goals G, you could, if you so chose, then align yourself to G. (You wouldn’t choose to, but that’s not the claim.)
For the G, the difficulty of aligning yourself to G is mainly about evaluating G in the spirit of a utility function over world outcomes (though G doesn’t have to be that). (Some G are more difficult to pursue than others of course, e.g. by making more demands on convergent resources.)
G can include paperclips or whatever. I take the last claim to be more like strong OT.
It sounds like you might disagree, unless this
Goals being expressed in the ontology of the cognitive architecture are less complicated than goals expressed in a very different ontology that the agent doesn’t believe in. This seems like a “complication” in the sense of strong orthogonality.
is mainly about the dynamical aspects? I.e. just saying that “well it’s very unlikely for an alignment-understander to choose to do that”?
I get that you don’t necessarily buy that alignment is a thing, but if the question makes sense, do you think that IF it is a thing, THEN you can do it for a huge class of goals, which includes something that’s well-described as “paperclip maxxing”? (I agree that it’s not straightforwardly unproblematic to discuss “paperclip maxxing”; my assertion is that, included in this class, is plenty of Gs that would match what I’m trying to talk about by “paperclip maxxing”, and would result in the universe being filled with things that we could reasonably agree are paperclips.)
There’s probably such a thing as “understanding alignment”. (Acknowleding that this is very ambiguous.)
I guess, idk?
It’s likely that IF you understood alignment, THEN for some large class of goals G, you could, if you so chose, then align yourself to G. (You wouldn’t choose to, but that’s not the claim.)
The “if” here seems likely. It might be that some designs permit larger classes than others. Unclear how big the classes are. Idk about the details.
For the G, the difficulty of aligning yourself to G is mainly about evaluating G in the spirit of a utility function over world outcomes (though G doesn’t have to be that). (Some G are more difficult to pursue than others of course, e.g. by making more demands on convergent resources.)
I don’t know, seems like that depends on the alignment understanding, cognitive architecture of successor agent, etc. Cognitive architecture and ontology would constrain type signatures for utility functions. And maybe the effective cognitive architectures don’t factor nicely. Idk.
I.e. just saying that “well it’s very unlikely for an alignment-understander to choose to do that”?
Quoting myself again on how I’m interpreting “complication”:
As an example of the computational tractability consideration, sufficiently complex goals may only be well-represented by sufficiently intelligent agents. “Complication” may be reflected in, for example, code complexity; to my mind, the strong form implies that the code complexity of an agent with a given level of intelligence and goals is approximately the code complexity of the intelligence plus the code complexity of the goal specification, plus a constant. Code complexity would influence statistical likelihood for the usual Kolmogorov/Solomonoff reasons, of course.
So it’s not just the dynamical aspects of “this would unlikely to be built”. At risk of repeating myself, the “measuring intelligence and reverse-engineering goals” discusses some non-dynamical aspects as well.
So I’m not just saying “for dynamical reasons” and I “idk” to your statement and my posts (obliqueness & reverse-engineering) go into more detail on what I think.
This is what I quoted in the Obliqueness post and elaborated on:
And I wasn’t thinking it was dependent on “post self alignment”, it could also apply to the construction itself being less natural. It’s possible Eliezer meant something different from what I meant by strong orthogonality, but I hope it is apparent why I and others would interpret it to be a non-trivial claim, rather than a slight variation on weak orthogonality.
See measuring intelligence and reverse-engineering goals for some more of my thinking on this. Relative to a given cognitive architecture, ~everyone agrees that there are especially stupid goals, the interesting question is whether there are especially smart goals; I think probably yes. Hence maybe weak orthogonality would route through variations between architectures (rather than within an architecture) to hit all points (intelligence, goals), and maybe sufficiently high intelligence levels are only compatible with a narrow range of goals (which would perhaps contradict weak orthogonality, but maybe not in an important way, and the general shape of cognitive architecture / goal / intelligence correlation matters more)
I take OP to be disagreeing with strong orthogonality, at least my interpretation of it from the obliqueness post:
This is something I’d agree with: Goals being expressed in the ontology of the cognitive architecture are less complicated than goals expressed in a very different ontology that the agent doesn’t believe in. This seems like a “complication” in the sense of strong orthogonality. (I get maybe you don’t interpret this way based on close reading of the orthogonality post! But nonetheless I think my reading is reasonable.)
It’s not a completely crazy interpretation. I don’t think it’s super reasonable based on the text.
Anyway, maybe there still is a substantive disagreement here. I would claim that
There’s probably such a thing as “understanding alignment”. (Acknowleding that this is very ambiguous.)
It’s likely that IF you understood alignment, THEN for some large class of goals G, you could, if you so chose, then align yourself to G. (You wouldn’t choose to, but that’s not the claim.)
For the G, the difficulty of aligning yourself to G is mainly about evaluating G in the spirit of a utility function over world outcomes (though G doesn’t have to be that). (Some G are more difficult to pursue than others of course, e.g. by making more demands on convergent resources.)
G can include paperclips or whatever. I take the last claim to be more like strong OT.
It sounds like you might disagree, unless this
is mainly about the dynamical aspects? I.e. just saying that “well it’s very unlikely for an alignment-understander to choose to do that”?
I get that you don’t necessarily buy that alignment is a thing, but if the question makes sense, do you think that IF it is a thing, THEN you can do it for a huge class of goals, which includes something that’s well-described as “paperclip maxxing”? (I agree that it’s not straightforwardly unproblematic to discuss “paperclip maxxing”; my assertion is that, included in this class, is plenty of Gs that would match what I’m trying to talk about by “paperclip maxxing”, and would result in the universe being filled with things that we could reasonably agree are paperclips.)
I guess, idk?
The “if” here seems likely. It might be that some designs permit larger classes than others. Unclear how big the classes are. Idk about the details.
I don’t know, seems like that depends on the alignment understanding, cognitive architecture of successor agent, etc. Cognitive architecture and ontology would constrain type signatures for utility functions. And maybe the effective cognitive architectures don’t factor nicely. Idk.
Quoting myself again on how I’m interpreting “complication”:
So it’s not just the dynamical aspects of “this would unlikely to be built”. At risk of repeating myself, the “measuring intelligence and reverse-engineering goals” discusses some non-dynamical aspects as well.
So I’m not just saying “for dynamical reasons” and I “idk” to your statement and my posts (obliqueness & reverse-engineering) go into more detail on what I think.
that’s a characteristic I’ve always found peculiar about your posts: the boost in understanding the thesis that the act of reading them affords.
i can understand however how that wouldn’t be a critics first guess