Humans might respect the preferences of weak agents right now, but if they thought about it for longer they’d pretty robustly just want to completely destroy the existing agents (including a hypothetical alien creator) and replace them with something better. No reason to honor that kind of arbitrary path dependence.
No, this doesn’t feel accurate. What I am saying is more something like:
The way humans think about the question of “preferences for weak agents” and “kindness” feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of “having a continuous stream of consciousness with a good past and good future is important” to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.
The way this comes apart seems very chaotic to me, and dependent enough on the exact metaethical and cultural and environmental starting conditions that I wouldn’t be that surprised if I disagree even with other humans on their resulting conceptualization of “kindness” (and e.g. one endpoint might be that I end up not having a special preference for currently-alive beings, but there are thousands, maybe millions of ways for this concept to fray apart under optimization pressure).
In other words, I think it’s plausible that at something like human level of capabilities and within a roughly human ontology (which AIs might at least partially share, though how much is quite uncertain to me), the concept of kindness as assigning value to the extrapolated preferences of beings that currently exist might be a thing that an AI could share. But I expect it to not hold up under reflection, and much greater power, and predictable ontological changes (that I expect any AI go to through as it reaches superintelligence), so that the resulting reflectively stable and optimized idea of kindness will not meaningfully results in current humans genuine preferences being fulfilled (by my own lights of what it means to extrapolate and fulfill someone’s preferences). The space of possibilities in which this concept could fray apart seems quite great, and many of the endpoints are unlikely to align with my endpoints of this concept.
Edit (some more thoughts): The thing you said feels related to that in that I think my own pretty huge uncertainty about how I will relate to kindness on reflection is evidence that I think iterating on that concept will be quite chaotic and different for different minds.
I do want to push back on “in favor of utilitarian optimization”. That is not what I am saying, or at least it feels somewhat misleading.
I am saying that I think it’s pretty likely that upon reflection I no longer think that my “kindness” goals are meaningfully achieved by caring about the beings alive in 2022, and that it would be more kind, by my own lights, to not give special consideration to beings who happened to be alive right now. This isn’t about “trading off kindness in favor of utilitarian optimization”, it’s saying that when you point towards the thing in me that generates an instinct towards kindness, I can imagine that as I more fully realize what that instinct cashes out to in terms of preferences, that it will not result in actually giving consideration to e.g. rats that are currently alive, or would give consideration to some archetype of a rat that is actually not really that much like a rat, because I don’t even really know what it means for a rat to want something, and similarly the way the AI relates to the question of “do humans want things” will feel similarly underdetermined (and again, these are just concrete examples of how the concept could come apart, not trying to be an exhaustive list of ways the concept could fall apart).
I think some of the confusion here comes from my using “kind” to refer to “respecting the preferences of existing weak agents,” I don’t have a better handle but could have just used a made up word.
I don’t quite understand your objection to my summary—it seems like you are saying that notions like “kindness” (that might currently lead you to respect the preferences of existing agents) will come apart and change in unpredictable ways as agents deliberate. The result is that smart minds will predictably stop respecting the preferences of existing agents, up to and including killing them all to replace them with something that more efficiently satisfies other values (including whatever kind of form “kindness” may end up taking, e.g. kindness towards all the possible minds who otherwise won’t get to exist).
I called this utilitarian optimization but it might have been more charitable to call it “impartial” optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It’s also “utilitarian” in the sense that it’s willing to spare nothing (or at least not 1/trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like “utilitarian” is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.
I think some of the confusion here comes from my using “kind” to refer to “respecting the preferences of existing weak agents,” I don’t have a better handle but could have just used a made up word.
Yeah, sorry, I noticed the same thing a few minutes ago, that I was probably at least somewhat misled by the more standard meaning of kindness.
Tabooing “kindness” I am saying something like:
Yes, I don’t think extrapolated current humans assign approximately any value to the exact preference of “respecting the preferences of existing weak agents” and I don’t really believe that you would on-reflection endorse that preference either.
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions, like ‘agent’ being a meaningful concept in the first place, or ‘existing’ or ‘weak’ or ‘preferences’, all of which I expect I would think are probably terribly confused concepts to use after I had understood the real concepts that carve reality more at its joints, and this means this sentence sounds deceptively simple or robust, but really doesn’t feel like the kind of thing whose meaning will stay simple as an AI does more conceptual refinement.
I called this utilitarian optimization but it might have been more charitable to call it “impartial” optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It’s also “utilitarian” in the sense that it’s willing to spare nothing (or at least not 1/trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like “utilitarian” is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.
The reason why I objected to this characterization is that I was trying to point at a more general thing than the “impartialness”. Like, to paraphrase what this sentence sounds like to me, it’s more as if someone from a pre-modern era was arguing about future civilizations and said “It’s weird that your conception of future humans are willing to do nothing for the gods that live in the sky, and the spirits that make our plants grow”.
Like, after a bunch of ontological reflection and empirical data gathering, “gods” is just really not a good abstraction for things I care about anymore. I don’t think “impartiality” is what is causing me to not care about gods, it’s just that the concept of “gods” seems fake and doesn’t carve reality at its joints anymore. It’s also not the case that I don’t care at all about ancient gods anymore (they are pretty cool and I like the aesthetic), but they way I care about them is very different from how I care about other humans.
Not caring about gods doesn’t feel “harsh” or “utilitarian” or in some sense like I have decided to abandon any part of my values. This is what I expect it to feel like for a future human to look back at our meta-preferences for many types of other beings, and also what it feels like for AIs that maybe have some initial version of ‘caring about others’ when they are at similar capability levels to humans.
This again isn’t capturing my objection perfectly, but maybe helps point to it better.
Yes, I don’t think extrapolated current humans assign approximately any value to the exact preference of “respecting the preferences of existing weak agents” and I don’t really believe that you would on-reflection endorse that preference either.
I am quite confident that I do, and it tends to infuriate my friends who get cranky that I feel a moral obligation to respect the artistic intent of bacterial genomes: all bacteria should go vegan, yet survive, and eat food equivalent to their previous.
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions,
I feel pretty uncertain of what assumptions are hiding in your “optimize strongly against X” statements. Historically this just seems hard to tease out, and wouldn’t be surprised if I were just totally misreading you here.
I think that a realistic “respecting preferences of weak agents”-shard doesn’t bid for plans which maximally activate the “respect preferences of weak agents” internal evaluation metric, or even do some tight bounded approximation thereof.
A “respect weak preferences” shard might also guide the AI’s value and ontology reformation process.
A nice person isn’t being maximally nice, nor do they wish to be; they are nicely being nice.
I do agree (insofar as I understand you enough to agree) that we should worry about some “strong optimization over the AI’s concepts, later in AI developmental timeline.” But I think different kinds of “heavy optimization” lead to different kinds of alignment concerns.
When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).
Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans ‘want’ to be preserved (at least according to a conventional notion of preferences).
I think this empirical view seems pretty implausible.
That said, I think it’s quite plausible that upon reflection, I’d want to ‘wink out’ any existing copies of myself in favor of using resources better things. But this is partially because I personally (in my current state) would endorse such a thing: if my extrapolated volition thought it would be better to not exist (in favor of other resource usage), my current self would accept that. And, I think it currently seems unlikely that upon reflection, I’d want to end all human lives (in particular, I think I probably would want to keep humans alive who had preferences against non-existence). This applies regardless of trade; it’s important to note this to avoid a ‘perpetual motion machine’ type argument.
Beyond this, I think that most or many humans or aliens would, upon reflection, want to preserve currently existing humans or aliens who had a preference against non-existence. (Again, regardless of trade.)
Additionally, I think it’s quite plausible that most or many humans or aliens will enact various trades or precommitments prior to reflecting (which is probably ill-advised, but it will happen regardless). So current preferences which aren’t stable under reflection might have a significant influence overall.
This feels like it is not really understanding my point, though maybe best to move this to some higher-bandwidth medium if the point is that hard to get across.
Giving it one last try: What I am saying is that I don’t think “conventional notion of preferences” is a particularly well-defined concept, and neither are a lot of other concepts you are using in order to make your predictions here. What it means to care about the preferences of others is a thing with a lot of really messy details that tend to blow up in different ways when you think harder about it and are less anchored on the status-quo.
I don’t think you currently know in what ways you would care about the preferences of others after a lot of reflection (barring game-theoretic considerations which I think we can figure out a bit more in-advance, but I am bracketing that whole angle in this discussion, though I totally agree those are important and relevant). I do think you will of course endorse the way you care about other people’s preferences after you’ve done a lot of reflection (otherwise something went wrong in your reflection process), but I don’t think you would endorse what AIs would do, and my guess is you also wouldn’t endorse what a lot of other humans would do when they undergo reflection here.
Like, what I am saying is that while there might be a relatively broad basin of conditions that give rise to something that locally looks like caring about other beings, the space of caring about other beings is deep and wide, and if you have an AI that cares about other beings preferences in some way you don’t endorse, this doesn’t actually get you anything. And I think the arguments that the concept of “caring about others” that an AI might have (though my best guess is that it won’t even have anything that is locally well-described by that) will hold up after a lot of reflection seem much weaker to me than the arguments that it will have that preference at roughly human capability and ethical-reflection levels (which seems plausible to me, though still overall unlikely).
Zeroth approximation of pseudokindness is strict nonintervention, reifying the patient-in-environment as a closed computation and letting it run indefinitely, with some allocation of compute. Interaction with the outside world creates vulnerability to external influence, but then again so does incautious closed computation, as we currently observe with AI x-risk, which is not something beamed in from outer space.
Formulation of the kinds of external influences that are appropriate for a particular patient-in-environment is exactly the topic of membranes/boundaries, this task can be taken as the defining desideratum for the topic. Specifically, the question of which environments can be put in contact with a particular membrane without corrupting it, hence why I think membranes are relevant to pseudokindness. Naturality of the membranes/boundaries abstraction is linked to naturality of the pseudokindness abstraction.
In contrast, the language of preferences/optimization seems to be the wrong frame for formulating pseudokindness, it wants to discuss ways of intervening and influencing, of not leaving value on the table, rather than ways of offering acceptable options that avoid manipulation. It might be possible to translate pseudokindness back into the language of preferences, but this translation would induce a kind of deontological prior on preferences that makes the more probable preferences look rather surprising/unnatural from a more preferences-first point of view.
No, this doesn’t feel accurate. What I am saying is more something like:
The way humans think about the question of “preferences for weak agents” and “kindness” feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of “having a continuous stream of consciousness with a good past and good future is important” to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.
The way this comes apart seems very chaotic to me, and dependent enough on the exact metaethical and cultural and environmental starting conditions that I wouldn’t be that surprised if I disagree even with other humans on their resulting conceptualization of “kindness” (and e.g. one endpoint might be that I end up not having a special preference for currently-alive beings, but there are thousands, maybe millions of ways for this concept to fray apart under optimization pressure).
In other words, I think it’s plausible that at something like human level of capabilities and within a roughly human ontology (which AIs might at least partially share, though how much is quite uncertain to me), the concept of kindness as assigning value to the extrapolated preferences of beings that currently exist might be a thing that an AI could share. But I expect it to not hold up under reflection, and much greater power, and predictable ontological changes (that I expect any AI go to through as it reaches superintelligence), so that the resulting reflectively stable and optimized idea of kindness will not meaningfully results in current humans genuine preferences being fulfilled (by my own lights of what it means to extrapolate and fulfill someone’s preferences). The space of possibilities in which this concept could fray apart seems quite great, and many of the endpoints are unlikely to align with my endpoints of this concept.
Edit (some more thoughts): The thing you said feels related to that in that I think my own pretty huge uncertainty about how I will relate to kindness on reflection is evidence that I think iterating on that concept will be quite chaotic and different for different minds.
I do want to push back on “in favor of utilitarian optimization”. That is not what I am saying, or at least it feels somewhat misleading.
I am saying that I think it’s pretty likely that upon reflection I no longer think that my “kindness” goals are meaningfully achieved by caring about the beings alive in 2022, and that it would be more kind, by my own lights, to not give special consideration to beings who happened to be alive right now. This isn’t about “trading off kindness in favor of utilitarian optimization”, it’s saying that when you point towards the thing in me that generates an instinct towards kindness, I can imagine that as I more fully realize what that instinct cashes out to in terms of preferences, that it will not result in actually giving consideration to e.g. rats that are currently alive, or would give consideration to some archetype of a rat that is actually not really that much like a rat, because I don’t even really know what it means for a rat to want something, and similarly the way the AI relates to the question of “do humans want things” will feel similarly underdetermined (and again, these are just concrete examples of how the concept could come apart, not trying to be an exhaustive list of ways the concept could fall apart).
I think some of the confusion here comes from my using “kind” to refer to “respecting the preferences of existing weak agents,” I don’t have a better handle but could have just used a made up word.
I don’t quite understand your objection to my summary—it seems like you are saying that notions like “kindness” (that might currently lead you to respect the preferences of existing agents) will come apart and change in unpredictable ways as agents deliberate. The result is that smart minds will predictably stop respecting the preferences of existing agents, up to and including killing them all to replace them with something that more efficiently satisfies other values (including whatever kind of form “kindness” may end up taking, e.g. kindness towards all the possible minds who otherwise won’t get to exist).
I called this utilitarian optimization but it might have been more charitable to call it “impartial” optimization. Impartiality between the existing creatures and the not-yet-created creatures seems like one of the key characteristics of utilitarianism while being very rare in the broader world . It’s also “utilitarian” in the sense that it’s willing to spare nothing (or at least not 1/trillion) for the existing creatures, and this kind of maximizing stance is also one of the big defining features of utilitarianism. So I do still feel like “utilitarian” is an OK way at pointing at the basic difference between where you expect intelligent minds will end up vs how normal people think about concepts like being nice.
Yeah, sorry, I noticed the same thing a few minutes ago, that I was probably at least somewhat misled by the more standard meaning of kindness.
Tabooing “kindness” I am saying something like:
Yes, I don’t think extrapolated current humans assign approximately any value to the exact preference of “respecting the preferences of existing weak agents” and I don’t really believe that you would on-reflection endorse that preference either.
Separately (though relatedly), each word in that sentence sure feels like the kind of thing that I do not feel comfortable leaning on heavily as I optimize strongly against it, and that hides a ton of implicit assumptions, like ‘agent’ being a meaningful concept in the first place, or ‘existing’ or ‘weak’ or ‘preferences’, all of which I expect I would think are probably terribly confused concepts to use after I had understood the real concepts that carve reality more at its joints, and this means this sentence sounds deceptively simple or robust, but really doesn’t feel like the kind of thing whose meaning will stay simple as an AI does more conceptual refinement.
The reason why I objected to this characterization is that I was trying to point at a more general thing than the “impartialness”. Like, to paraphrase what this sentence sounds like to me, it’s more as if someone from a pre-modern era was arguing about future civilizations and said “It’s weird that your conception of future humans are willing to do nothing for the gods that live in the sky, and the spirits that make our plants grow”.
Like, after a bunch of ontological reflection and empirical data gathering, “gods” is just really not a good abstraction for things I care about anymore. I don’t think “impartiality” is what is causing me to not care about gods, it’s just that the concept of “gods” seems fake and doesn’t carve reality at its joints anymore. It’s also not the case that I don’t care at all about ancient gods anymore (they are pretty cool and I like the aesthetic), but they way I care about them is very different from how I care about other humans.
Not caring about gods doesn’t feel “harsh” or “utilitarian” or in some sense like I have decided to abandon any part of my values. This is what I expect it to feel like for a future human to look back at our meta-preferences for many types of other beings, and also what it feels like for AIs that maybe have some initial version of ‘caring about others’ when they are at similar capability levels to humans.
This again isn’t capturing my objection perfectly, but maybe helps point to it better.
I am quite confident that I do, and it tends to infuriate my friends who get cranky that I feel a moral obligation to respect the artistic intent of bacterial genomes: all bacteria should go vegan, yet survive, and eat food equivalent to their previous.
I feel pretty uncertain of what assumptions are hiding in your “optimize strongly against X” statements. Historically this just seems hard to tease out, and wouldn’t be surprised if I were just totally misreading you here.
That said, your writing makes me wonder “where is the heavy optimization [over the value definitions] coming from?”, since I think the preference-shards themselves are the things steering the optimization power. For example, the shards are not optimizing over themselves to find adversarial examples to themselves. Related statements:
I think that a realistic “respecting preferences of weak agents”-shard doesn’t bid for plans which maximally activate the “respect preferences of weak agents” internal evaluation metric, or even do some tight bounded approximation thereof.
A “respect weak preferences” shard might also guide the AI’s value and ontology reformation process.
A nice person isn’t being maximally nice, nor do they wish to be; they are nicely being nice.
I do agree (insofar as I understand you enough to agree) that we should worry about some “strong optimization over the AI’s concepts, later in AI developmental timeline.” But I think different kinds of “heavy optimization” lead to different kinds of alignment concerns.
When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).
Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans ‘want’ to be preserved (at least according to a conventional notion of preferences).
I think this empirical view seems pretty implausible.
That said, I think it’s quite plausible that upon reflection, I’d want to ‘wink out’ any existing copies of myself in favor of using resources better things. But this is partially because I personally (in my current state) would endorse such a thing: if my extrapolated volition thought it would be better to not exist (in favor of other resource usage), my current self would accept that. And, I think it currently seems unlikely that upon reflection, I’d want to end all human lives (in particular, I think I probably would want to keep humans alive who had preferences against non-existence). This applies regardless of trade; it’s important to note this to avoid a ‘perpetual motion machine’ type argument.
Beyond this, I think that most or many humans or aliens would, upon reflection, want to preserve currently existing humans or aliens who had a preference against non-existence. (Again, regardless of trade.)
Additionally, I think it’s quite plausible that most or many humans or aliens will enact various trades or precommitments prior to reflecting (which is probably ill-advised, but it will happen regardless). So current preferences which aren’t stable under reflection might have a significant influence overall.
This feels like it is not really understanding my point, though maybe best to move this to some higher-bandwidth medium if the point is that hard to get across.
Giving it one last try: What I am saying is that I don’t think “conventional notion of preferences” is a particularly well-defined concept, and neither are a lot of other concepts you are using in order to make your predictions here. What it means to care about the preferences of others is a thing with a lot of really messy details that tend to blow up in different ways when you think harder about it and are less anchored on the status-quo.
I don’t think you currently know in what ways you would care about the preferences of others after a lot of reflection (barring game-theoretic considerations which I think we can figure out a bit more in-advance, but I am bracketing that whole angle in this discussion, though I totally agree those are important and relevant). I do think you will of course endorse the way you care about other people’s preferences after you’ve done a lot of reflection (otherwise something went wrong in your reflection process), but I don’t think you would endorse what AIs would do, and my guess is you also wouldn’t endorse what a lot of other humans would do when they undergo reflection here.
Like, what I am saying is that while there might be a relatively broad basin of conditions that give rise to something that locally looks like caring about other beings, the space of caring about other beings is deep and wide, and if you have an AI that cares about other beings preferences in some way you don’t endorse, this doesn’t actually get you anything. And I think the arguments that the concept of “caring about others” that an AI might have (though my best guess is that it won’t even have anything that is locally well-described by that) will hold up after a lot of reflection seem much weaker to me than the arguments that it will have that preference at roughly human capability and ethical-reflection levels (which seems plausible to me, though still overall unlikely).
Zeroth approximation of pseudokindness is strict nonintervention, reifying the patient-in-environment as a closed computation and letting it run indefinitely, with some allocation of compute. Interaction with the outside world creates vulnerability to external influence, but then again so does incautious closed computation, as we currently observe with AI x-risk, which is not something beamed in from outer space.
Formulation of the kinds of external influences that are appropriate for a particular patient-in-environment is exactly the topic of membranes/boundaries, this task can be taken as the defining desideratum for the topic. Specifically, the question of which environments can be put in contact with a particular membrane without corrupting it, hence why I think membranes are relevant to pseudokindness. Naturality of the membranes/boundaries abstraction is linked to naturality of the pseudokindness abstraction.
In contrast, the language of preferences/optimization seems to be the wrong frame for formulating pseudokindness, it wants to discuss ways of intervening and influencing, of not leaving value on the table, rather than ways of offering acceptable options that avoid manipulation. It might be possible to translate pseudokindness back into the language of preferences, but this translation would induce a kind of deontological prior on preferences that makes the more probable preferences look rather surprising/unnatural from a more preferences-first point of view.
Thanks for writing this. I also think what we want from psuedokindness is captured from membranes/boundaries.
Possibly relevant?