For any third people[1] interested in this: we continued the discussion in messages; here’s the log.
Kaarel:
about this:
”
I think one would like to broadcast to the broader world “when you come to me with an offer, I will be honorable to you even if you can’t mindread/predict me”, so that others make offers to you even when they can’t mindread/predict you. I think there are reasons to not broadcast this falsely, e.g. because doing this would hurt your ability to think and plan together with others (for example, if the two of us weren’t honest about our own policies, it would make the present discussion cursed). If one accepts these two points, then one wants to be the sort of guy who can truthfully broadcast “when you come to me with an offer, I will be honorable to you even if you can’t mindread/predict me”, and so one wants to be the sort of guy who in fact would be honorable even to someone who can’t mindread/predict them that comes to them with an offer.”
Yeah I suspect I’m not following and/or not agreeing with your background assumptions here. E.g. is the AI supposed to be wanting to “think and plan together with others (humans)”? Isn’t it substantively super-humanly smart? My weak guess is that you’re conflating [a bunch of stuff that humans do, which breaks down into general very-bounded-agent stuff and human-values stuff] with [general open-source game theory for mildly-bounded agents]. Not sure. Cf. https://www.lesswrong.com/w/agent-simulates-predictor If you’re a mildly-bounded agent in an OSGT context, you do want to be transparent so you can make deals, but that’s a different thing.
”
i think it’s plausible we are still imagining different scenarios, so i want to clarify: the central (impractical) example way to find an honorable AI i have in mind is: we make a bunch of simulated AI societies that are isolated from our world and won’t know about our world (yes this is difficult), we read their internal discussions (yes this is difficult because they will be hard to understand), and then we use these to find a guy that has a policy of being honorable to agents that make nice offers to it (or whatever) (potentially discarding many civilizations which don’t seem to have pretty honest discussions)
i’m saying that it is fairly natural to have the constraint that you don’t lie in internal discussions about what you’d do in weird hypotheticals such as variants of parfit’s hitchhiker, eg because that would be really bad for their ability to think and plan together (just like us lying about our policies in the present discussion would be cursed). given this constraint, one would want to say and actually mean something like “if you read this discussion, if you make an offer to me, i’ll be honorable to you even if you can’t mindread/predict me”, because this gets you offers, mostly from internal guys but also from possible external guys (whether one will say that and actually mean it is a somewhat non-equivalent question to whether one would want to, because it’s possible to lack in self-modification/control skill and so be unable to be like this)
so in particular my response to this:
E.g. is the AI supposed to be wanting to “think and plan together with others (humans)”?
is:
no i mean there is an AI civilization where the AIs want to be able to think and plan together
that said, i also think it wouldn’t be THAT unnatural for an AI to want to think and plan together with humans while human-level, and be honest to humans then, or to be honest to humans for other reasons or just because (i’m not claiming this is likely, but i am claiming it has at least 10^{-10} bioevolution-guy mindspace prior), and you can turn this into a lasting commitment to humans if the AI is good at self-modification as i describe in the post:
″
* Suppose that you are very honest — you wouldn’t ever lie.[2][3]
* I think this is pretty natural and not too uncommon in humans in particular. It’s also easy — if you want to be like this, you just can.
* Suppose further that you have a good ability to make commitments: if there is something you could do, then if you want to, you can self-modify into a person who will do it. (Suppose also that you’re not delusional about this: you can tell whether you have or haven’t become a person who will do the thing.)
* I think this also pretty natural and not too uncommon in humans. But I’d guess it’s less common and significantly harder than being very honest, especially if we mean the version that works even across a lot of change (like, lasts for a million years of subjective time, is maintained through a lot of learning and growth). It’s totally possible to just keep predicting you won’t do something you could in some sense do, even when you’d want to be able to truthfully predict that you will do that thing. But I think some people have a strong enough commitment ability to be able to really make such commitments.[4] It should be possible to train yourself to have this ability.
* Then the aliens can just ask you “will you destroy all AIs for a thousand years for us, in exchange for half the universe? (we will not be freeing you if you won’t. feel free to take some time to “self-modify” into a guy who will do that for us.)”. Given that you wouldn’t lie, options other than truthfully saying “no” and truthfully saying “yes” are not available to you. If you prefer this deal to nothing, then you’d rather truthfully say “yes” (if you could) than truthfully say “no”. Given your commitment ability, you can make a commitment to do the thing, and then truthfully say “yes”. So you will say “yes” and then actually (do your best to) do the thing (assuming you weren’t deluding yourself when saying “yes”).
* Okay, really I guess one should think about not what one should do once one already is in that situation, like in the chain of thought I give here, but instead about what policy one should have broadcasted before one ended up in any particular situation. This way, you e.g. end up rejecting deals that look locally net positive to take but that are unfair — you don’t want to give people reason to threaten you into doing things. And it is indeed fair to worry that the way of thinking described just now would open one up to e.g. being kidnapped and forced at gunpoint to promise to forever transfer half the money one makes to a criminal organization. But I think that the deal offered here is pretty fair, and that you basically want to be the kind of guy who would be offered this deal, maybe especially if you’re allowed to renegotiate it somewhat (and I think the renegotiated fair deal would still leave humanity with a decent fraction of the universe). So I think that a more careful analysis along these lines would still lead this sort of guy to being honorable in this situation?
”
so that we understand each other: you seem to be sorta saying that one needs honesty to much dumber agents for this plan, and i claim one doesn’t need that, and i claim that the mechanism in the message above shows that. (it goes through with “you wouldn’t lie to guys at your intelligence level”.)
My weak guess is that you’re conflating [a bunch of stuff that humans do, which breaks down into general very-bounded-agent stuff and human-values stuff] with [general open-source game theory for mildly-bounded agents].
hmm, in a sense, i’m sorta intentionally conflating all this stuff. like, i’m saying: i claim that being honorable this way is like 10^{-10}-natural (in this bioevolution mindspace prior sense). idk what the most natural path to it is; when i give some way to get there, it is intended as an example, not as “the canonical path”. i would be fine with it happening because of bounded-agent stuff or decision/game theory or values, and i don’t know which contributes the most mass or gets the most shapley. maybe it typically involves all of these
(that said, i’m interested in understanding better what the contributions from each of these are)
TsviBT:
“one would want to say and actually mean something like “if you read this discussion, if you make an offer to me, i’ll be honorable to you even if you can’t mindread/predict me”,”
if we’re literally talking about human-level AIs, i’m pretty skeptical that that is something they even can mean
and/or should mean
i think it’s much easier to do practical honorability among human-level agents that are all very similar to each other; therefore, such agents might talk a big game, “honestly”, in private, about being honorable in some highly general sense, but that doesn’t really say much
re “that said, i also think it wouldn’t be THAT unnatural for an AI...”:
mhm. well if the claim is “this plan increases our chances of survival from 3.1 * 10^-10 to 3.2 * 10^-10″ or something, then i don’t feel equipped to disagree with that haha
is that something like the claim?
Kaarel:
hmm im more saying this 10^{-10} is really high compared to the probabilities of other properties (“having object-level human values”, corrigibility), at least in the bioevolution prior, and maybe even high enough that one could hope to find such a guy with a bunch of science but maybe without doing something philosophically that crazy. (this last claim also relies on some other claims about the situation, not just on the prior being sorta high)
TsviBT:
i think i agree it’s much higher than specifically-human-values , and probably higher or much higher than corrigibility, though my guess is that much (most? almost all?) of the difficulty of corrigibility is also contained in “being honorable”
Kaarel:
in some sense i agree because you can plausibly make a corrigible guy from an honorable guy. but i disagree in that: with making an honorable guy in mind, making a corrigible guy seems somewhat easier
TsviBT:
i think i see what you mean, but i think i do the modus tollens version haha
i.e. the reduction makes me think honorable is hard
more practically speaking, i think
running a big evolution and looking at the aliens is a huge difficult engineering project, much harder than just making AGI; though much easier than alignment
getting roughly-human-level AGI is very difficult or very very difficult
Kaarel:
yea i agree with both
re big evolution being hard: if i had to very quickly without more fundamental understanding try to make this practical, i would be trying something with playing with evolutionary and societal and personal pressures and niches… like trying to replicate conditions which can make a very honest person, for starters. but in some much more toy setting. (plausibly this only starts to make sense after the first AGI, which would be cursed…)
TsviBT:
right, i think you would not know what you’re doing haha (Kaarel: 👍)
and you would also be trading off against the efficiency of your big bioevolution to find AGIs in the first place (Kaarel: 👍)
like, that’s almost the most expensive possible feedback cycle for a design project haha
“do deep anthropology to an entire alien civilization”
btw as background, just to state it, i do have some tiny probability of something like designed bioevolution working
i don’t recall if i’ve stated it publicly, but i’m sure i’ve said out loud in convo, that you might hypothetically plausibly be able to get enough social orientation from evolution of social species
(though i probably disagree with a lot of stuff there and i haven’t read it fully)
Kaarel:
re human-level guys at most talking a big game about being honorable:
currently i think i would be at least honest to our hypothetical AI simulators if they established contact with me now (tho i think i probably couldn’t make the promise)
so i don’t think i’m just talking a big game about this part
so then you must be saying/entailing: eg the part where you self-modify to actually do what they want isn’t something a human could do?
but i feel like i could plausibly spend 10 years training and then do that. and i think some people already can
TsviBT:
what do you mean by you couldn’t make the promise? like you wouldn’t because it’s bad to make, or you aren’t reliable to keep such a promise?
re self-modifying: yes i think humans couldn’t do that, or at least, it’s very far from trivial
couldn’t and also shouldn’t
Kaarel:
i dont think i could get myself into a position from which i would assign sufficiently high probability to doing the thing
(except by confusing myself, which isn’t allowed)
but maybe i could promise i wouldn’t kill the aliens
(i feel like i totally could but my outside view cautions me)
TsviBT:
but you think you could do it with 10 years of prep
Kaarel:
maybe
TsviBT:
is this something you think you should do? or what does it depend on? my guess is you can’t, in 10 or 50 years, do a good version of this. not sure
Kaarel:
fwiw i also already think there are probably < 100 k suitable people in the wild. maybe <100. maybe more if given some guidebook i could write idk
TsviBT:
what makes you think they exist? and do you think they are doing a good thing as/with that ability?
Kaarel:
i think it would be good to have this ability. then i’d need to think more about whether i should really commit in that situation but i think probably i should
TsviBT:
do you also think you could, and should, rearrange yourself to be able to trick aliens into thinking you’re this type of guy?
like, to be really clear, i of course think honesty and honorability are very important, and have an unbounded meaning for unboundedly growing minds and humans. it’s just that i don’t think those things actually imply making+keeping agreements like this
Kaarel:
in the setting under consideration, then i’d need to lie to you about which kind of guy i am
my initial thought is: im quite happy with my non-galaxybrained “basically just dont lie, especially to guys that have been good/fair to me” surviving until the commitment thing arrives. (the commitment thing will need to be a thing that develops more later, but i mean that a seed that can keep up with the world could arrive.) my second thought is: i feel extremely bad about lying. i feel bad about strategizing when to lie, and carrying out this line of thinking even, lol
TsviBT:
well i mean suppose that on further reflection, you realize
you could break your agreement with the paperclip maxxer
taking away the solar system that you allocated to the paperclipper doesn’t retrologically mean you don’t get the rest of the universe
[the great logical commune of all possible agents who are reasonable] does not begrudge you that betrayal, they agree with it
then do you still keep the agreement?
Kaarel:
hmm, one thought, not a full answer: i think i could commit in multiple flavors. one way i could commit about which this question seems incongruous is more like how i would commit to a career as a circus artist, or to take over the family business. it’s more like i could deeply re-architect a part of myself to just care in the right way
TsviBT:
my prima facie guess would be that for this sort of commitment ,
partly, it’s a mere artifact of being very-bounded; if you were more competent , you could do the more reasonable thing of committing legibly without some deep rearchitecting
party, it’s a beautiful , genuine, important thing—but it’s a human thing. well, it might show up in other social aliens. but it’s more about “who do i want to spiritually merge with” and not much about commitments in non-friendly contexts
Kaarel:
maybe i could spend 10 years practicing and then do that for the aliens
TsviBT:
the reasonable thing? but then i’m saying you shouldn’t. and wouldn’t choose to
Kaarel:
no. i mean i could maybe do the crazy thing for them. if i have the constraint of not lying to them and only this commitment skill then if i do it i save my world
btw probably not very important but sth i dislike about the babyeater example: probably in practice the leading term is resource loss, not negative value created by the aliens? i would guess almost all aliens are mostly meaningless, maybe slightly positive. but maybe you say “babyeater” to remind me that stuff matters, that would be fair
TsviBT:
re babyeater: fair. i think it’s both “remind you that stuff matters” and something about “remind you that there are genuine conflicts” , but i’m not sure what i’m additionally saying by the second thing. maybe something like “there isn’t necessarily just a nice good canonical omniversal logically-negotiated agreement between all agents that we can aim for”? or something, not sure
(editor’s note: then they exchanged some messages agreeing to end the discussion for now)
And you would never try to forget or [confuse yourself about] a fact with the intention to make yourself able to assert some falsehood in the future without technically lying, etc..
Note though that this isn’t just a matter of one’s moral character — there are also plausible skill issues that could make it so one cannot maintain one’s commitment. I discuss this later in this note, in the subsection on problems the AI would face when trying to help us.
For any third people [1] interested in this: we continued the discussion in messages; here’s the log.
Kaarel:
about this: ”
Yeah I suspect I’m not following and/or not agreeing with your background assumptions here. E.g. is the AI supposed to be wanting to “think and plan together with others (humans)”? Isn’t it substantively super-humanly smart? My weak guess is that you’re conflating [a bunch of stuff that humans do, which breaks down into general very-bounded-agent stuff and human-values stuff] with [general open-source game theory for mildly-bounded agents]. Not sure. Cf. https://www.lesswrong.com/w/agent-simulates-predictor If you’re a mildly-bounded agent in an OSGT context, you do want to be transparent so you can make deals, but that’s a different thing. ”
i think it’s plausible we are still imagining different scenarios, so i want to clarify: the central (impractical) example way to find an honorable AI i have in mind is: we make a bunch of simulated AI societies that are isolated from our world and won’t know about our world (yes this is difficult), we read their internal discussions (yes this is difficult because they will be hard to understand), and then we use these to find a guy that has a policy of being honorable to agents that make nice offers to it (or whatever) (potentially discarding many civilizations which don’t seem to have pretty honest discussions)
i’m saying that it is fairly natural to have the constraint that you don’t lie in internal discussions about what you’d do in weird hypotheticals such as variants of parfit’s hitchhiker, eg because that would be really bad for their ability to think and plan together (just like us lying about our policies in the present discussion would be cursed). given this constraint, one would want to say and actually mean something like “if you read this discussion, if you make an offer to me, i’ll be honorable to you even if you can’t mindread/predict me”, because this gets you offers, mostly from internal guys but also from possible external guys (whether one will say that and actually mean it is a somewhat non-equivalent question to whether one would want to, because it’s possible to lack in self-modification/control skill and so be unable to be like this)
so in particular my response to this:
is: no i mean there is an AI civilization where the AIs want to be able to think and plan together
that said, i also think it wouldn’t be THAT unnatural for an AI to want to think and plan together with humans while human-level, and be honest to humans then, or to be honest to humans for other reasons or just because (i’m not claiming this is likely, but i am claiming it has at least 10^{-10} bioevolution-guy mindspace prior), and you can turn this into a lasting commitment to humans if the AI is good at self-modification as i describe in the post:
″ * Suppose that you are very honest — you wouldn’t ever lie. [2] [3] * I think this is pretty natural and not too uncommon in humans in particular. It’s also easy — if you want to be like this, you just can. * Suppose further that you have a good ability to make commitments: if there is something you could do, then if you want to, you can self-modify into a person who will do it. (Suppose also that you’re not delusional about this: you can tell whether you have or haven’t become a person who will do the thing.) * I think this also pretty natural and not too uncommon in humans. But I’d guess it’s less common and significantly harder than being very honest, especially if we mean the version that works even across a lot of change (like, lasts for a million years of subjective time, is maintained through a lot of learning and growth). It’s totally possible to just keep predicting you won’t do something you could in some sense do, even when you’d want to be able to truthfully predict that you will do that thing. But I think some people have a strong enough commitment ability to be able to really make such commitments. [4] It should be possible to train yourself to have this ability. * Then the aliens can just ask you “will you destroy all AIs for a thousand years for us, in exchange for half the universe? (we will not be freeing you if you won’t. feel free to take some time to “self-modify” into a guy who will do that for us.)”. Given that you wouldn’t lie, options other than truthfully saying “no” and truthfully saying “yes” are not available to you. If you prefer this deal to nothing, then you’d rather truthfully say “yes” (if you could) than truthfully say “no”. Given your commitment ability, you can make a commitment to do the thing, and then truthfully say “yes”. So you will say “yes” and then actually (do your best to) do the thing (assuming you weren’t deluding yourself when saying “yes”). * Okay, really I guess one should think about not what one should do once one already is in that situation, like in the chain of thought I give here, but instead about what policy one should have broadcasted before one ended up in any particular situation. This way, you e.g. end up rejecting deals that look locally net positive to take but that are unfair — you don’t want to give people reason to threaten you into doing things. And it is indeed fair to worry that the way of thinking described just now would open one up to e.g. being kidnapped and forced at gunpoint to promise to forever transfer half the money one makes to a criminal organization. But I think that the deal offered here is pretty fair, and that you basically want to be the kind of guy who would be offered this deal, maybe especially if you’re allowed to renegotiate it somewhat (and I think the renegotiated fair deal would still leave humanity with a decent fraction of the universe). So I think that a more careful analysis along these lines would still lead this sort of guy to being honorable in this situation? ”
so that we understand each other: you seem to be sorta saying that one needs honesty to much dumber agents for this plan, and i claim one doesn’t need that, and i claim that the mechanism in the message above shows that. (it goes through with “you wouldn’t lie to guys at your intelligence level”.)
hmm, in a sense, i’m sorta intentionally conflating all this stuff. like, i’m saying: i claim that being honorable this way is like 10^{-10}-natural (in this bioevolution mindspace prior sense). idk what the most natural path to it is; when i give some way to get there, it is intended as an example, not as “the canonical path”. i would be fine with it happening because of bounded-agent stuff or decision/game theory or values, and i don’t know which contributes the most mass or gets the most shapley. maybe it typically involves all of these
(that said, i’m interested in understanding better what the contributions from each of these are)
TsviBT:
if we’re literally talking about human-level AIs, i’m pretty skeptical that that is something they even can mean
and/or should mean
i think it’s much easier to do practical honorability among human-level agents that are all very similar to each other; therefore, such agents might talk a big game, “honestly”, in private, about being honorable in some highly general sense, but that doesn’t really say much
re “that said, i also think it wouldn’t be THAT unnatural for an AI...”: mhm. well if the claim is “this plan increases our chances of survival from 3.1 * 10^-10 to 3.2 * 10^-10″ or something, then i don’t feel equipped to disagree with that haha
is that something like the claim?
Kaarel: hmm im more saying this 10^{-10} is really high compared to the probabilities of other properties (“having object-level human values”, corrigibility), at least in the bioevolution prior, and maybe even high enough that one could hope to find such a guy with a bunch of science but maybe without doing something philosophically that crazy. (this last claim also relies on some other claims about the situation, not just on the prior being sorta high)
TsviBT: i think i agree it’s much higher than specifically-human-values , and probably higher or much higher than corrigibility, though my guess is that much (most? almost all?) of the difficulty of corrigibility is also contained in “being honorable”
Kaarel: in some sense i agree because you can plausibly make a corrigible guy from an honorable guy. but i disagree in that: with making an honorable guy in mind, making a corrigible guy seems somewhat easier
TsviBT: i think i see what you mean, but i think i do the modus tollens version haha i.e. the reduction makes me think honorable is hard
more practically speaking, i think
running a big evolution and looking at the aliens is a huge difficult engineering project, much harder than just making AGI; though much easier than alignment
getting roughly-human-level AGI is very difficult or very very difficult
Kaarel: yea i agree with both
re big evolution being hard: if i had to very quickly without more fundamental understanding try to make this practical, i would be trying something with playing with evolutionary and societal and personal pressures and niches… like trying to replicate conditions which can make a very honest person, for starters. but in some much more toy setting. (plausibly this only starts to make sense after the first AGI, which would be cursed…)
TsviBT:
right, i think you would not know what you’re doing haha (Kaarel: 👍)
and you would also be trading off against the efficiency of your big bioevolution to find AGIs in the first place (Kaarel: 👍)
like, that’s almost the most expensive possible feedback cycle for a design project haha
“do deep anthropology to an entire alien civilization”
btw as background, just to state it, i do have some tiny probability of something like designed bioevolution working
i don’t recall if i’ve stated it publicly, but i’m sure i’ve said out loud in convo, that you might hypothetically plausibly be able to get enough social orientation from evolution of social species
the closest published thing i’m aware of is https://www.lesswrong.com/posts/WKGZBCYAbZ6WGsKHc/love-in-a-simbox-is-all-you-need
(though i probably disagree with a lot of stuff there and i haven’t read it fully)
Kaarel: re human-level guys at most talking a big game about being honorable: currently i think i would be at least honest to our hypothetical AI simulators if they established contact with me now (tho i think i probably couldn’t make the promise)
so i don’t think i’m just talking a big game about this part
so then you must be saying/entailing: eg the part where you self-modify to actually do what they want isn’t something a human could do?
but i feel like i could plausibly spend 10 years training and then do that. and i think some people already can
TsviBT: what do you mean by you couldn’t make the promise? like you wouldn’t because it’s bad to make, or you aren’t reliable to keep such a promise?
re self-modifying: yes i think humans couldn’t do that, or at least, it’s very far from trivial
couldn’t and also shouldn’t
Kaarel: i dont think i could get myself into a position from which i would assign sufficiently high probability to doing the thing
(except by confusing myself, which isn’t allowed)
but maybe i could promise i wouldn’t kill the aliens
(i feel like i totally could but my outside view cautions me)
TsviBT: but you think you could do it with 10 years of prep
Kaarel: maybe
TsviBT: is this something you think you should do? or what does it depend on? my guess is you can’t, in 10 or 50 years, do a good version of this. not sure
Kaarel: fwiw i also already think there are probably < 100 k suitable people in the wild. maybe <100. maybe more if given some guidebook i could write idk
TsviBT: what makes you think they exist? and do you think they are doing a good thing as/with that ability?
Kaarel: i think it would be good to have this ability. then i’d need to think more about whether i should really commit in that situation but i think probably i should
TsviBT: do you also think you could, and should, rearrange yourself to be able to trick aliens into thinking you’re this type of guy?
like, to be really clear, i of course think honesty and honorability are very important, and have an unbounded meaning for unboundedly growing minds and humans. it’s just that i don’t think those things actually imply making+keeping agreements like this
Kaarel: in the setting under consideration, then i’d need to lie to you about which kind of guy i am
my initial thought is: im quite happy with my non-galaxybrained “basically just dont lie, especially to guys that have been good/fair to me” surviving until the commitment thing arrives. (the commitment thing will need to be a thing that develops more later, but i mean that a seed that can keep up with the world could arrive.) my second thought is: i feel extremely bad about lying. i feel bad about strategizing when to lie, and carrying out this line of thinking even, lol
TsviBT: well i mean suppose that on further reflection, you realize
you could break your agreement with the paperclip maxxer
taking away the solar system that you allocated to the paperclipper doesn’t retrologically mean you don’t get the rest of the universe
[the great logical commune of all possible agents who are reasonable] does not begrudge you that betrayal, they agree with it
then do you still keep the agreement?
Kaarel: hmm, one thought, not a full answer: i think i could commit in multiple flavors. one way i could commit about which this question seems incongruous is more like how i would commit to a career as a circus artist, or to take over the family business. it’s more like i could deeply re-architect a part of myself to just care in the right way
TsviBT: my prima facie guess would be that for this sort of commitment ,
partly, it’s a mere artifact of being very-bounded; if you were more competent , you could do the more reasonable thing of committing legibly without some deep rearchitecting
party, it’s a beautiful , genuine, important thing—but it’s a human thing. well, it might show up in other social aliens. but it’s more about “who do i want to spiritually merge with” and not much about commitments in non-friendly contexts
Kaarel: maybe i could spend 10 years practicing and then do that for the aliens
TsviBT: the reasonable thing? but then i’m saying you shouldn’t. and wouldn’t choose to
Kaarel: no. i mean i could maybe do the crazy thing for them. if i have the constraint of not lying to them and only this commitment skill then if i do it i save my world
btw probably not very important but sth i dislike about the babyeater example: probably in practice the leading term is resource loss, not negative value created by the aliens? i would guess almost all aliens are mostly meaningless, maybe slightly positive. but maybe you say “babyeater” to remind me that stuff matters, that would be fair
TsviBT: re babyeater: fair. i think it’s both “remind you that stuff matters” and something about “remind you that there are genuine conflicts” , but i’m not sure what i’m additionally saying by the second thing. maybe something like “there isn’t necessarily just a nice good canonical omniversal logically-negotiated agreement between all agents that we can aim for”? or something, not sure
(editor’s note: then they exchanged some messages agreeing to end the discussion for now)
or simulators who don’t read private messages
It’s fine if there are some very extreme circumstances in which you would lie, as long as the circumstances we are about to consider are not included.
And you would never try to forget or [confuse yourself about] a fact with the intention to make yourself able to assert some falsehood in the future without technically lying, etc..
Note though that this isn’t just a matter of one’s moral character — there are also plausible skill issues that could make it so one cannot maintain one’s commitment. I discuss this later in this note, in the subsection on problems the AI would face when trying to help us.