Can you give other conceptions of “impact” that people have proposed, and compare/contrast them with “How does this change my ability to get what I want?”
Also, there’s a bunch of different things that “want” could mean. Is that something you’ve thought about and if so, is it important to pick the right sense of “want”?
(BTW, in these kinds of sequences I never know whether to ask a question midway through or to wait and see if it will be resolved later. Maybe it would help to have a table of contents at the start? Or should I just ask and let the author say that they’ll be answered later in the sequence?)
Can you give other conceptions of “impact” that people have proposed, and compare/contrast them with “How does this change my ability to get what I want?”
This is not quite what you’re asking for, but I have a post on ways people have thought AIs that minimise ‘impact’ should behave in certain situations, and you can go through and see what the notion of ‘impact’ given in this post would advise. [ETA: although that’s somewhat tricky, since this post only defines ‘impact’ and doesn’t say how agent should behave to minimise it]
Can you give other conceptions of “impact” that people have proposed, and compare/contrast them with “How does this change my ability to get what I want?”
The next post will cover this.
there’s a bunch of different things that “want” could mean. Is that something you’ve thought about and if so, is it important to pick the right sense of “want”?
I haven’t considered this at length yet. Since we’re only thinking descriptively right now and in light of where the sequence is headed, I don’t know it’s important to nail down the right sense. That said, I’m still quite interested in doing so.
In terms of the want/like distinction (keeping in mind that want is being used in its neuroscientific that-which-motivates sense, and not the sense I’ve been using in the post), consider the following:
A University of Michigan study analyzed the brains of rats eating a favorite food. They found separate circuits for “wanting” and “liking”, and were able to knock out either circuit without affecting the other… When they knocked out the “liking” system, the rats would eat exactly as much of the food without making any of the satisifed lip-licking expression, and areas of the brain thought to be correlated with pleasure wouldn’t show up in the MRI. Knock out “wanting”, and the rats seem to enjoy the food as much when they get it but not be especially motivated to seek it out.
Are wireheads happy?
Imagining my “liking” system being forever disabled feels pretty terrible, but not maximally negatively impactful (because I also have preferences about the world, not just how much I enjoy my life). Imagining my “wanting” system being disabled feels similar to imagining losing significant executive function—it’s not that I wouldn’t be able to find value in life, but my future actions now seem unlikely to be pushing my life and the world towards outcomes I prefer. Good things still might happen, and I’d like that, but they seem less likely to come about.
The above is still cheating, because I’m using “preferences” in my speculation, but I think it helps pin down things a bit. It seems like there’s some combination of liking/endorsing for “how good things are”, while “wanting” comes into play when I’m predicting how I’ll act (more on that in two posts, along with other embedded agentic considerations re: “ability to get”).
Or should I just ask and let the author say that they’ll be answered later in the sequence?
Doing this is fine! We’re basically past the point where I wanted to avoid past framings, so people can talk about whatever (although I reserve the right to reply “this will be much easier to discuss later”).
Can you give other conceptions of “impact” that people have proposed, and compare/contrast them with “How does this change my ability to get what I want?”
The next post will cover this.
(no way to double quote it seems...maybe nested BBCode?)
Anyhow, looking forward to that as I was struggling a bit with the claim cannot be a big deal if it doesn’t impact my getting what I want without being tautological.
Well, the claim is tautological, after all! The problem with the first part of this sequence is that it can seem… obvious… until you realize that almost all prior writing about impact has not even acknowledged that we want the AI to leave us able to get what we want (to preserve our attainable utility). By default, one considers what “big deals” have in common, and then thinks about not breaking vases / not changing too much stuff in the world state. This attractor is so strong that when I say, “wait, maybe it’s not primarily about vases or objects”, it didn’t make sense.
The point of the first portion of the sequence isn’t to amaze people with the crazy surprising insane twists I’ve discovered in what impact really is about—it’s to show how things add up to normalcy, so as to set the stage for a straightforward discussion about one promising direction I have in mind for averting instrumental incentives.
The problem with the first part of this sequence is that it can seem… obvious… until you realize that almost all prior writing about impact has not even acknowledged that we want the AI to leave us able to get what we want (to preserve our attainable utility).
Agreed. This has been my impression from reading previous work on impact.
Let me substantiate my claim a bit with a random sampling; I just pulled up a relative reachability blogpost. From the first paragraph, (emphasis mine)
An incorrect or incomplete specification of the objective can result in undesirable behavior like specification gaming or causing negative side effects. There are various ways to make the notion of a “side effect” more precise – I think of it as a disruption of the agent’s environment that is unnecessary for achieving its objective. For example, if a robot is carrying boxes and bumps into a vase in its path, breaking the vase is a side effect, because the robot could have easily gone around the vase. On the other hand, a cooking robot that’s making an omelette has to break some eggs, so breaking eggs is not a side effect.
But notice now we’re talking about “disruption of the agent’s environment”. Relative reachability is indeed tackling the impact measure problem, so using what we now understand we might prefer to reframe as:
We think about “side effects” when they change our attainable utilities, so they’re really just a conceptual discretization of “things which negatively affect us”. We want the robot to prefer policies which avoid overly changing our attainable utilities. For example, if a robot is carrying boxes and bumps into a vase in its path, breaking the vase is a side effect, because it’s not that easy for us to repair the vase...
Can you give other conceptions of “impact” that people have proposed, and compare/contrast them with “How does this change my ability to get what I want?”
Also, there’s a bunch of different things that “want” could mean. Is that something you’ve thought about and if so, is it important to pick the right sense of “want”?
(BTW, in these kinds of sequences I never know whether to ask a question midway through or to wait and see if it will be resolved later. Maybe it would help to have a table of contents at the start? Or should I just ask and let the author say that they’ll be answered later in the sequence?)
This is not quite what you’re asking for, but I have a post on ways people have thought AIs that minimise ‘impact’ should behave in certain situations, and you can go through and see what the notion of ‘impact’ given in this post would advise. [ETA: although that’s somewhat tricky, since this post only defines ‘impact’ and doesn’t say how agent should behave to minimise it]
The next post will cover this.
I haven’t considered this at length yet. Since we’re only thinking descriptively right now and in light of where the sequence is headed, I don’t know it’s important to nail down the right sense. That said, I’m still quite interested in doing so.
In terms of the want/like distinction (keeping in mind that want is being used in its neuroscientific that-which-motivates sense, and not the sense I’ve been using in the post), consider the following:
Imagining my “liking” system being forever disabled feels pretty terrible, but not maximally negatively impactful (because I also have preferences about the world, not just how much I enjoy my life). Imagining my “wanting” system being disabled feels similar to imagining losing significant executive function—it’s not that I wouldn’t be able to find value in life, but my future actions now seem unlikely to be pushing my life and the world towards outcomes I prefer. Good things still might happen, and I’d like that, but they seem less likely to come about.
The above is still cheating, because I’m using “preferences” in my speculation, but I think it helps pin down things a bit. It seems like there’s some combination of liking/endorsing for “how good things are”, while “wanting” comes into play when I’m predicting how I’ll act (more on that in two posts, along with other embedded agentic considerations re: “ability to get”).
Doing this is fine! We’re basically past the point where I wanted to avoid past framings, so people can talk about whatever (although I reserve the right to reply “this will be much easier to discuss later”).
(no way to double quote it seems...maybe nested BBCode?)
Anyhow, looking forward to that as I was struggling a bit with the claim cannot be a big deal if it doesn’t impact my getting what I want without being tautological.
Well, the claim is tautological, after all! The problem with the first part of this sequence is that it can seem… obvious… until you realize that almost all prior writing about impact has not even acknowledged that we want the AI to leave us able to get what we want (to preserve our attainable utility). By default, one considers what “big deals” have in common, and then thinks about not breaking vases / not changing too much stuff in the world state. This attractor is so strong that when I say, “wait, maybe it’s not primarily about vases or objects”, it didn’t make sense.
The point of the first portion of the sequence isn’t to amaze people with the crazy surprising insane twists I’ve discovered in what impact really is about—it’s to show how things add up to normalcy, so as to set the stage for a straightforward discussion about one promising direction I have in mind for averting instrumental incentives.
Agreed. This has been my impression from reading previous work on impact.
Let me substantiate my claim a bit with a random sampling; I just pulled up a relative reachability blogpost. From the first paragraph, (emphasis mine)
But notice now we’re talking about “disruption of the agent’s environment”. Relative reachability is indeed tackling the impact measure problem, so using what we now understand we might prefer to reframe as: