Notice I replied to that comment you linked and agreed with John, but not that any generalized vector dot product model is wrong, but that the specific one in that post is wrong as it doesn’t weight by expected probability ( ie an incorrect distance function).
Anyway I used that only as a convenient example to illustrate a model which separates degree of misalignment from net impact, my general point does not depend on the details of the model and would still stand for any arbitrarily complex non-linear model.
The general point being that degree of misalignment is only relevant to the extent it translates into a difference in net utility.
Sure, but if you need a complicated distance metric to describe your space, that makes it correspondingly harder to actually describe utility functions corresponding to vectors within that space which are “close” under that metric.
If you actually believe the sharp left turn argument holds water, where is the evidence?
As as I said earlier this evidence must take a specific form, as evidence in the historical record
Hold on; why? Even for simple cases of goal misspecification, the misspecification may not become obvious without a sufficiently OOD environment; does that thereby mean that no misspecification has occurred?
And in the human case, why does it not suffice to look at the internal motivations humans have, and describe plausible changes to the environment for which those motivations would then fail to correspond even approximately to IGF, as I did w.r.t. uploading?
But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven’t actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.
It seems to me that this suffices to establish that the primary barrier against such a breakdown in correspondence is that of insufficient capabilities—which is somewhat the point!
If you actually believe the sharp left turn argument holds water, where is the evidence?
As as I said earlier this evidence must take a specific form, as evidence in the historical record
Hold on; why? Even for simple cases of goal misspecification, the misspecification may not become obvious without a sufficiently OOD environment;
Given any practical and reasonably aligned agent, there is always some set of conceivable OOD environments where that agent fails. Who cares? There is a single success criteria: utility in the real world! The success criteria is not “is this design perfectly aligned according to my adversarial pedantic critique”.
The sharp left turn argument uses the analogy of brain evolution misaligned to IGF to suggest/argue for doom from misaligned AGI. But brains enormously increased human fitness rather than the predicted decrease, so the argument fails.
In worlds where 1. alignment is very difficult, and 2. misalignment leads to doom (low utility) this would naturally translate into a great filter around intelligence—which we do not observe in the historical record. Evolution succeeded at brain alignment on the first try.
And in the human case, why does it not suffice to look at the internal motivations humans have, and describe plausible changes to the environment for which those motivations would then fail
I think this entire line of thinking is wrong—you have little idea what environmental changes are plausible and next to no idea of how brains would adapt.
On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.
When you move the discussion to speculative future technology to support the argument from a historical analogy—you have conceded that the historical analogy does not support your intended conclusion (and indeed it can not, because homo sapiens is an enormous alignment success).
It sounds like you’re arguing that uploading is impossible, and (more generally) have defined the idea of “sufficiently OOD environments” out of existence. That doesn’t seem like valid thinking to me.
Of course i’m not arguing that uploading is impossible, and obviously there are always hypothetical “sufficiently OOD environments”. But from the historical record so far we can only conclude that evolution’s alignments of brains was robust enough compared to the environment distribution shift encountered—so far. Naturally that could all change in the future, given enough time, but piling in such future predictions is clearly out of scope for an argument from historical analogy.
These are just extremely different:
an argument from historical observations
an argument from future predicted observations
It’s like I’m arguing that given that we observed the sequence 0,1,3,7 the pattern is probably 2^N-1, and you arguing that it isn’t because you predict the next digit is 31.
Regardless uploads are arguably sufficiently categorically different that its questionable how they even relate to evolutionary success of homo sapien brain alignment to genetic fitness (do sims of humans count for genetic fitness? but only if DNA is modeled in some fashion? to what level of approximation? etc.)
Sure, but if you need a complicated distance metric to describe your space, that makes it correspondingly harder to actually describe utility functions corresponding to vectors within that space which are “close” under that metric.
Hold on; why? Even for simple cases of goal misspecification, the misspecification may not become obvious without a sufficiently OOD environment; does that thereby mean that no misspecification has occurred?
And in the human case, why does it not suffice to look at the internal motivations humans have, and describe plausible changes to the environment for which those motivations would then fail to correspond even approximately to IGF, as I did w.r.t. uploading?
It seems to me that this suffices to establish that the primary barrier against such a breakdown in correspondence is that of insufficient capabilities—which is somewhat the point!
Given any practical and reasonably aligned agent, there is always some set of conceivable OOD environments where that agent fails. Who cares? There is a single success criteria: utility in the real world! The success criteria is not “is this design perfectly aligned according to my adversarial pedantic critique”.
The sharp left turn argument uses the analogy of brain evolution misaligned to IGF to suggest/argue for doom from misaligned AGI. But brains enormously increased human fitness rather than the predicted decrease, so the argument fails.
In worlds where 1. alignment is very difficult, and 2. misalignment leads to doom (low utility) this would naturally translate into a great filter around intelligence—which we do not observe in the historical record. Evolution succeeded at brain alignment on the first try.
I think this entire line of thinking is wrong—you have little idea what environmental changes are plausible and next to no idea of how brains would adapt.
When you move the discussion to speculative future technology to support the argument from a historical analogy—you have conceded that the historical analogy does not support your intended conclusion (and indeed it can not, because homo sapiens is an enormous alignment success).
It sounds like you’re arguing that uploading is impossible, and (more generally) have defined the idea of “sufficiently OOD environments” out of existence. That doesn’t seem like valid thinking to me.
Of course i’m not arguing that uploading is impossible, and obviously there are always hypothetical “sufficiently OOD environments”. But from the historical record so far we can only conclude that evolution’s alignments of brains was robust enough compared to the environment distribution shift encountered—so far. Naturally that could all change in the future, given enough time, but piling in such future predictions is clearly out of scope for an argument from historical analogy.
These are just extremely different:
an argument from historical observations
an argument from future predicted observations
It’s like I’m arguing that given that we observed the sequence 0,1,3,7 the pattern is probably 2^N-1, and you arguing that it isn’t because you predict the next digit is 31.
Regardless uploads are arguably sufficiently categorically different that its questionable how they even relate to evolutionary success of homo sapien brain alignment to genetic fitness (do sims of humans count for genetic fitness? but only if DNA is modeled in some fashion? to what level of approximation? etc.)
Uploading is impossible because the cat ate the Internet cable again
Would you say it’s … _cat_egorically impossible?