I think my bigger point is that you don’t seem to make any real argument as to which case we are in. For example, consider the following model of how people’s perception of my trustworthiness might be correlated to my actual trustworthiness:
There are two causal chains:
My values → Things I say → Peoples’ perceptions
My values → My actions
So if I value trustworthiness, I will not, for example talk much about wanting to avoid being sucker (in contexts where it would refer to be doing trustworthy things). This will influence peoples’ perceptions of whether or not I am trustworthy. Furthermore, if I do value trustworthiness, I will want to be trustworthy.
This setup makes things look very much like the smoking lesion problem. A CDT agent that values trustworthiness will be trustworthy because they place intrinsic value in it. A CDT agent that does not value trustworthiness will be perceived as being untrustworthy. Simply changing their actions will not alter this perception, and therefore they will fail to be trustworthy in situations where it benefits them, and this is the correct decision.
Now you might try to break the causal link:
My values → Things that I say
And doing so is certainly possible (I mean you can have spies that successfully pretend to be loyal for extended periods without giving themselves away). On the other hand, it might not happen often for several possible reasons:
A) Maintaining a facade at all times is exhausting (and thus imposes high costs)
B) Lying consistently is hard (as in too computationally expensive)
C) The right way to lie consistently, is to simulate the altered value set, but this may actually lead to changing your values (standard advice for become more confident is pretending to be confident, right?).
So yes, in this model an non-trust-valuing and self-modifying CDT agent will self-modify, but it will need to self-modify its values rather than its decision theory. Using a decision theory that is trustworthy despite not intrinsically valuing it doesn’t help.
I think my bigger point is that you don’t seem to make any real argument as to which case we are in. For example, consider the following model of how people’s perception of my trustworthiness might be correlated to my actual trustworthiness: There are two causal chains: My values → Things I say → Peoples’ perceptions My values → My actions So if I value trustworthiness, I will not, for example talk much about wanting to avoid being sucker (in contexts where it would refer to be doing trustworthy things). This will influence peoples’ perceptions of whether or not I am trustworthy. Furthermore, if I do value trustworthiness, I will want to be trustworthy.
This setup makes things look very much like the smoking lesion problem. A CDT agent that values trustworthiness will be trustworthy because they place intrinsic value in it. A CDT agent that does not value trustworthiness will be perceived as being untrustworthy. Simply changing their actions will not alter this perception, and therefore they will fail to be trustworthy in situations where it benefits them, and this is the correct decision.
Now you might try to break the causal link: My values → Things that I say And doing so is certainly possible (I mean you can have spies that successfully pretend to be loyal for extended periods without giving themselves away). On the other hand, it might not happen often for several possible reasons: A) Maintaining a facade at all times is exhausting (and thus imposes high costs) B) Lying consistently is hard (as in too computationally expensive) C) The right way to lie consistently, is to simulate the altered value set, but this may actually lead to changing your values (standard advice for become more confident is pretending to be confident, right?).
So yes, in this model an non-trust-valuing and self-modifying CDT agent will self-modify, but it will need to self-modify its values rather than its decision theory. Using a decision theory that is trustworthy despite not intrinsically valuing it doesn’t help.