I think it is likely best to push against including that sort of thing in the Overton window of what’s considered AI safety / AI alignment literature.
Here’s my understanding of your reasoning: “this kind of work may have the unintended consequence of pushing people who would have otherwise worked on hard core problems of x-risk to more prosaic projects, lulling them into a false sense of security when progress is made.”
I think this is possible, but rather unlikely:
It isn’t clear that work allocation for immediate and long-term safety is zero-sum—Victoria wrote more about why this might not be the case.
The specific approach I took here might be conducive for getting more people currently involved with immediate safety interested in long-term approaches. That is, someone might be nodding along—“hey, this whitelisting thing might need some engineering to implement, but this is solid!” and then I walk them through the mental motions of discovering how it doesn’t work, helping them realize that the problem cuts far deeper than they thought.
In my mental model, this is far more likely than pushing otherwise-promising people to inaction.
I’m actually concerned that a lack of overlap between our communities will insulate immediate safety researchers from long-term considerations, having a far greater negative effect. I have weak personal evidence for this being the case.
Why would people (who would otherwise be receptive to rigorous thinking about x-risk) lose sight of the greater problems in alignment? I don’t expect DeepMind to say “hey, we implemented whitelisting, we’re good to go! Hit the switch.” In my model, people who would make a mistake like that probably were never thinking about x-risk to begin with.
my concern lies with approaches which specifically target “short- to mid-term” without the robustness to scale to tackle far-term.
As I understand it, this argument can also be applied to any work that doesn’t plausibly one-shot a significant alignment problem, potentially including research by OpenAI and DeepMind. While obviously we’d all prefer one-shots, sometimes research is more incremental (I’m sure this isn’t news to you!). Here, I set out to make progress on one of the Concrete Problems; after doing so, I thought “does this scale? What insights can we take away?”. I had relaxed the problem by assuming a friendly ontology, and I was curious what difficulties (if any) remained.
We are currently grading this approach by the most rigorous of metrics—I think this is good, as that’s how we will eventually be judged! However, we shouldn’t lose sight of the fact that most safety work won’t be immediately superintelligence-complete. Exploratory work is important. I definitely agree that we should shoot to kill—I’m not advocating an explicit focus on short-term problems. However, we shouldn’t screen off value we can get sharing imperfect results.
There are many measures of impact which one can come up with; as you say, all of these create other problems when optimized very hard, because the AI can find clever ways to have a very low impact, and these end up being counter to our intentions. Your whitelisting proposal has the same sorts of problems. The interesting thing is to get a notion of “low impact” exactly right, so that it doesn’t go wrong even in a very intelligent system.
I’d also like to push back slightly against an implication here—while it is now clear to me that “the interesting thing” is indeed this clinginess issue, this wasn’t apparent at the outset. Perhaps I missed some literature review, but there was no such discussion of the hard core issues of impact measures; Eliezer certainly discussed a few naive approaches, but the literature was otherwise rather slim.
Penalizing a shift in probability distributions can incentivize the agent to learn as little as possible, which is a bit weird.
Yeah, I noticed this too, but I put that under “how do we get agents to want to learn about how the world is—i.e., avoid wireheading?”. I also think that function composition with the raw utility would be helpful in avoiding weird interplay.
Certain versions will have the property that if the agent is already quite confident in what it will do, then consequences of those actions do not count as “changes” (no shift in probability when we condition on the act). This would create a loophole allowing for any actions to be “low impact” under the right conditions.
I don’t follow—the agent has a distribution for an object at time t, and another at t+1. It penalizes based on changes in its beliefs about the actual world at the time steps—not with respect to its expectation.
“this kind of work may have the unintended consequence of pushing people who would have otherwise worked on hard core problems of x-risk to more prosaic projects, lulling them into a false sense of security when progress is made.”
I think it is more like:
This kind of work seems likely to one day redirect funding intended for X-risk away from X-risk.
I know people who would point to this kind of thing to argue that AI can be made safe without the kind of deep decision theory thinking MIRI is interested in. Those people would probably argue against X-risk research regardless, but the more stuff there is that’s difficult for outsiders to distinguish from X-risk relevant research, the more difficulty outsiders have assessing such arguments.
So it isn’t so much that I think people who would work on X-risk would be redirected, as that I think there will be a point where people adjacent to X-risk research will have difficulty telling which people are actually trying to work on X-risk, and also what the state of the X-risk concerns is (I mean to what extent it has been addressed by the research.
As I understand it, this argument can also be applied to any work that doesn’t plausibly one-shot a significant alignment problem, potentially including research by OpenAI and DeepMind. While obviously we’d all prefer one-shots, sometimes research is more incremental (I’m sure this isn’t news to you!). Here, I set out to make progress on one of the Concrete Problems; after doing so, I thought “does this scale? What insights can we take away?”. I had relaxed the problem by assuming a friendly ontology, and I was curious what difficulties (if any) remained.
I agree that research has to be incremental. It should be taken almost for granted that anything currently written about the subject is not anywhere near a real solution even to a sub-problem of safety, unless otherwise stated. If I had to point out one line which caused me to have such a skeptical reaction to your post was, it would be:
I’m fairly confident that whitelisting contributes meaningfully to short- to mid-term AI safety,
If instead this had been presented as “here’s something interesting which doesn’t work” I would not have made the objection I made. IE, what’s important is not any contribution to near- or medium- term AI safety, but rather exploration of the landscape of low-impact RL, which may eventually contribute to reducing X-risk. IE, more the attitude you express here:
We are currently grading this approach by the most rigorous of metrics—I think this is good, as that’s how we will eventually be judged! However, we shouldn’t lose sight of the fact that most safety work won’t be immediately superintelligence-complete. Exploratory work is important.
So, I’m saying that exploratory work should not be justified as “confident that this contributes meaningfully to short-term safety”. Almost everything at this stage is more like “maybe useful for one day having better thoughts about reducing X-risk, maybe not”.
I don’t follow—the agent has a distribution for an object at time t, and another at t+1. It penalizes based on changes in its beliefs about how the world actually is at the time steps—not with respect to its expectation.
So it isn’t so much that I think people who would work on X-risk would be redirected, as that I think there will be a point where people adjacent to X-risk research will have difficulty telling which people are actually trying to work on X-risk, and also what the state of the X-risk concerns is (I mean to what extent it has been addressed by the research).
That makes more sense. I haven’t thought enough about this aspect to have a strong opinion yet. My initial thoughts are that
this problem can be basically avoided if this kind of work clearly points out where the problems would be if scaled.
I do think it’s plausible that some less-connected funding sources might get confused (NSF), but I’d be surprised if later FLI funding got diverted because of this. I think this kind of work will be done anyways, and it’s better to have people who think carefully about scale issues doing it.
your second bullet point reminds me of how some climate change skeptics will point to “evidence” from “scientists”, as if that’s what convinced them. In reality, however, they’re drawing the bottom line first, and then pointing to what they think is the most dignified support for their position. I don’t think that avoiding this kind of work would ameliorate that problem—they’d probably just find other reasons.
most people on the outside don’t understand x-risk anyways, because it requires thinking rigorously in a lot of ways to not end up a billion miles off of any reasonable conclusion. I don’t think that this additional straw will marginally add significant confusion.
IE, what’s important is not any contribution to near- or medium- term AI safety
I’m confused how “contributes meaningfully to short-term safety” and “maybe useful for having better thoughts” are mutually-exclusive outcomes, or why it’s wrong to say that I think my work contributes to short-term efforts. Sure, that may not be what you care about, but I think it’s still reasonable that I mention it.
I’m saying that exploratory work should not be justified as “confident that this contributes meaningfully to short-term safety”. Almost everything at this stage is more like “maybe useful for one day having better thoughts about reducing X-risk, maybe not”.
I’m confused why that latter statement wasn’t what came across! Later in that sentence, I state that I don’t think it will scale. I also made sure to highlight how it breaks down in a serious way when scaled up, and I don’t think I otherwise implied that it’s presently safe for long-term efforts.
I totally agree that having better thoughts about x-risk is a worthy goal at this point.
I’m confused how “contributes meaningfully to short-term safety” and “maybe useful for having better thoughts” are mutually-exclusive outcomes, or why it’s wrong to say that I think my work contributes to short-term efforts. Sure, that may not be what you care about, but I think it’s still reasonable that I mention it.
In hindsight I am regretting the way my response went. While it was my honest response, antagonizing newcomers to the field for paying any attention to whether their work might be useful for sub-AGI safety doesn’t seem like a good way to create the ideal research atmosphere. Sorry for being a jerk about it.
Although I did flinch a bit, my S2 reaction was “this is Abram, so if it’s criticism, it’s likely very high-quality. I’m glad I’m getting detailed feedback, even if it isn’t all positive”. Apology definitely accepted (although I didn’t view you as being a jerk), and really—thank you for taking the time to critique me a bit. :)
Here’s my understanding of your reasoning: “this kind of work may have the unintended consequence of pushing people who would have otherwise worked on hard core problems of x-risk to more prosaic projects, lulling them into a false sense of security when progress is made.”
I think this is possible, but rather unlikely:
It isn’t clear that work allocation for immediate and long-term safety is zero-sum—Victoria wrote more about why this might not be the case.
The specific approach I took here might be conducive for getting more people currently involved with immediate safety interested in long-term approaches. That is, someone might be nodding along—“hey, this whitelisting thing might need some engineering to implement, but this is solid!” and then I walk them through the mental motions of discovering how it doesn’t work, helping them realize that the problem cuts far deeper than they thought.
In my mental model, this is far more likely than pushing otherwise-promising people to inaction.
I’m actually concerned that a lack of overlap between our communities will insulate immediate safety researchers from long-term considerations, having a far greater negative effect. I have weak personal evidence for this being the case.
Why would people (who would otherwise be receptive to rigorous thinking about x-risk) lose sight of the greater problems in alignment? I don’t expect DeepMind to say “hey, we implemented whitelisting, we’re good to go! Hit the switch.” In my model, people who would make a mistake like that probably were never thinking about x-risk to begin with.
As I understand it, this argument can also be applied to any work that doesn’t plausibly one-shot a significant alignment problem, potentially including research by OpenAI and DeepMind. While obviously we’d all prefer one-shots, sometimes research is more incremental (I’m sure this isn’t news to you!). Here, I set out to make progress on one of the Concrete Problems; after doing so, I thought “does this scale? What insights can we take away?”. I had relaxed the problem by assuming a friendly ontology, and I was curious what difficulties (if any) remained.
We are currently grading this approach by the most rigorous of metrics—I think this is good, as that’s how we will eventually be judged! However, we shouldn’t lose sight of the fact that most safety work won’t be immediately superintelligence-complete. Exploratory work is important. I definitely agree that we should shoot to kill—I’m not advocating an explicit focus on short-term problems. However, we shouldn’t screen off value we can get sharing imperfect results.
I’d also like to push back slightly against an implication here—while it is now clear to me that “the interesting thing” is indeed this clinginess issue, this wasn’t apparent at the outset. Perhaps I missed some literature review, but there was no such discussion of the hard core issues of impact measures; Eliezer certainly discussed a few naive approaches, but the literature was otherwise rather slim.
Yeah, I noticed this too, but I put that under “how do we get agents to want to learn about how the world is—i.e., avoid wireheading?”. I also think that function composition with the raw utility would be helpful in avoiding weird interplay.
I don’t follow—the agent has a distribution for an object at time t, and another at t+1. It penalizes based on changes in its beliefs about the actual world at the time steps—not with respect to its expectation.
I think it is more like:
This kind of work seems likely to one day redirect funding intended for X-risk away from X-risk.
I know people who would point to this kind of thing to argue that AI can be made safe without the kind of deep decision theory thinking MIRI is interested in. Those people would probably argue against X-risk research regardless, but the more stuff there is that’s difficult for outsiders to distinguish from X-risk relevant research, the more difficulty outsiders have assessing such arguments.
So it isn’t so much that I think people who would work on X-risk would be redirected, as that I think there will be a point where people adjacent to X-risk research will have difficulty telling which people are actually trying to work on X-risk, and also what the state of the X-risk concerns is (I mean to what extent it has been addressed by the research.
I agree that research has to be incremental. It should be taken almost for granted that anything currently written about the subject is not anywhere near a real solution even to a sub-problem of safety, unless otherwise stated. If I had to point out one line which caused me to have such a skeptical reaction to your post was, it would be:
If instead this had been presented as “here’s something interesting which doesn’t work” I would not have made the objection I made. IE, what’s important is not any contribution to near- or medium- term AI safety, but rather exploration of the landscape of low-impact RL, which may eventually contribute to reducing X-risk. IE, more the attitude you express here:
So, I’m saying that exploratory work should not be justified as “confident that this contributes meaningfully to short-term safety”. Almost everything at this stage is more like “maybe useful for one day having better thoughts about reducing X-risk, maybe not”.
Ah, alright.
That makes more sense. I haven’t thought enough about this aspect to have a strong opinion yet. My initial thoughts are that
this problem can be basically avoided if this kind of work clearly points out where the problems would be if scaled.
I do think it’s plausible that some less-connected funding sources might get confused (NSF), but I’d be surprised if later FLI funding got diverted because of this. I think this kind of work will be done anyways, and it’s better to have people who think carefully about scale issues doing it.
your second bullet point reminds me of how some climate change skeptics will point to “evidence” from “scientists”, as if that’s what convinced them. In reality, however, they’re drawing the bottom line first, and then pointing to what they think is the most dignified support for their position. I don’t think that avoiding this kind of work would ameliorate that problem—they’d probably just find other reasons.
most people on the outside don’t understand x-risk anyways, because it requires thinking rigorously in a lot of ways to not end up a billion miles off of any reasonable conclusion. I don’t think that this additional straw will marginally add significant confusion.
I’m confused how “contributes meaningfully to short-term safety” and “maybe useful for having better thoughts” are mutually-exclusive outcomes, or why it’s wrong to say that I think my work contributes to short-term efforts. Sure, that may not be what you care about, but I think it’s still reasonable that I mention it.
I’m confused why that latter statement wasn’t what came across! Later in that sentence, I state that I don’t think it will scale. I also made sure to highlight how it breaks down in a serious way when scaled up, and I don’t think I otherwise implied that it’s presently safe for long-term efforts.
I totally agree that having better thoughts about x-risk is a worthy goal at this point.
In hindsight I am regretting the way my response went. While it was my honest response, antagonizing newcomers to the field for paying any attention to whether their work might be useful for sub-AGI safety doesn’t seem like a good way to create the ideal research atmosphere. Sorry for being a jerk about it.
Although I did flinch a bit, my S2 reaction was “this is Abram, so if it’s criticism, it’s likely very high-quality. I’m glad I’m getting detailed feedback, even if it isn’t all positive”. Apology definitely accepted (although I didn’t view you as being a jerk), and really—thank you for taking the time to critique me a bit. :)