I agree that it would be safer if humanity were a collective hivemind that could coordinate to not build AI until we know how to build the best AI, and that people should differentially work on things that make the situation better rather than worse, and that this potentially includes keeping quiet about information that would make things worse.
The problem is—as you say—”[i]t’s very rare that any research purely helps alignment”; you can’t think about aligning AI without thinking about AI. In order to navigate the machine intelligence transition in the most dignified way, you want your civilization’s best people to be doing their best thinking about the problem, and your best people can’t do their best thinking under the conditions of paranoid secrecy.
Concretely, I’ve been studying some deep learning basics lately and have written a coupleposts about things I’ve learned. I think this was good, not bad. I think I and my readers have a slightly better understanding of the technology in question than if I hadn’t studied and hadn’t written, and that better understanding will help us make better decisions in expectation.
This applies doubly so to work that aims to make AI understandable or helpful, rather than aligned—a helpful AI will help anyone
Sorry, what? I thought the fear was that we don’t know how to make helpful AI at all. (And that people who think they’re being helped by seductively helpful-sounding LLM assistants are being misled by surface appearances; the shoggoth underneath has its own desires that we won’t like when it’s powerful enough to persue them autonomously.) In contrast, this almost makes it sound like you think it is plausible to align AI to its user’s intent, but that this would be bad if the users aren’t one of “us”—you know, the good alignment researchers who want to use AI to take over the universe, totally unlike those evil capabilities researchers who want to use AI to produce economically valuable goods and services.
Sorry, what? I thought the fear was that we don’t know how to make helpful AI at all. (And that people who think they’re being helped by seductively helpful-sounding LLM assistants are being misled by surface appearances; the shoggoth underneath has its own desires that we won’t like when it’s powerful enough to persue them autonomously.) In contrast, this almost makes it sound like you think it is plausible to align AI to its user’s intent, but that this would be bad if the users aren’t one of “us”—you know, the good alignment researchers who want to use AI to take over the universe, totally unlike those evil capabilities researchers who want to use AI to produce economically valuable goods and services.
My steelman of this (though to be clear I think your comment makes good points):
There is a large difference between a system being helpful and a system being aligned. Ultimately AI existential risk is a coordination problem where I expect catastrophic consequences because a bunch of people want to build AGI without making it safe. Therefore making technologies that in a naive and short-term sense just help AGI developers build whatever they want to build will have bad consequences. If I trusted everyone to use their intelligence just for good things, we wouldn’t have anthropogenic existential risk on our hands.
Some of those technologies might end up useful for also getting the AI to be more properly aligned, or maybe to help with work that reduces the risk of AI catastrophe some other way, though my current sense is that kind of work is pretty different and doesn’t benefit remotely as much from generically locally-helpful AI.
In-general I feel pretty sad about conflating “alignment” with “short-term intent alignment”. I think the two problems are related but have really important crucial differences, I don’t think the latter generalizes that well to the former (for all the usual sycophancy/treacherous-turn reasons), and indeed progress on the latter IMO mostly makes the world marginally worse because the thing it is most likely to be used for is developing existentially dangerous AI systems faster.
Edit: Another really important dimension to model here is also not just the effect of that kind of research on what individual researchers will do, but what effect this kind of research will have on what the market wants to invest in. My standard story of doom is centrally rooted in there being very strong short-term individual economic incentives to build more capable AGI, enabling people to make billions to trillions of dollars, while the downside risk is a distributed negative externality that is not at all priced into the costs of AI development. Developing applications of AI that make a lot of money without accounting for the negative extinction externalities therefore can be really quite bad for the world.
The problem is that “helpful” oracle AI will not stay helpful for long, if there is any incentive to produce things which are less helpful. Your beliefs are apparently out of date: we have helpful AI now, so that’s an existence disproof of “helpful ai is impossible”. But the threat of AI being more evolutionarily fit, and possibly an AI taking sudden and intense action to make use of its being more evolutionarily fit, is still hanging over our heads; and it only takes one hyperdesperate not-what-you-meant seeker.
Concretely, I think your posts are in fact a great (but not at all worst-case) example of things that have more cost than benefit, and I think you should keep working but only talk to people in DMs. Time is very, very short, and if you accidentally have a pivotally negative impact, you could be the one that burns the last two days before the world is destroyed.
but that this would be bad if the users aren’t one of “us”—you know, the good alignment researchers who want to use AI to take over the universe, totally unlike those evil capabilities researchers who want to use AI to produce economically valuable goods and services.
Rather, ” us” — the good alignment researchers who will be careful at all about the long term effects of our actions, unlike capabilities researchers who are happy to accelerate race dynamics and increase p(doom) if they make a quick profit out of it in the short term.
I think these judgements would benefit from more concreteness: that rather than proposing a dichotomy of “capabilities research” (them, Bad) and “alignment research” (us, Good), you could be more specific about what kinds of work you want to see more and less of.
I agree that (say) Carmack and Sutton are doing a bad thing by declaring a goal to “build AGI” while dismissing the reasons that this is incredibly dangerous. But the thing that makes infohazard concerns so fraught is that there’s a lot of work that potentially affects our civilization’s trajectory into the machine intelligence transition in complicated ways, which makes it hard to draw a boundary around “trusted alignment researchers” in a principled and not self-serving way that doesn’t collapse into “science and technology is bad”.
We can agree that OpenAI as originally conceived was a bad idea. What about the people working on music generation? That’s unambiguously “capabilities”, but it’s also not particularly optimized at ending the world that way “AGI for AGI’s sake” projects are. If that’s still bad even though music generation isn’t going to end the world (because it’s still directing attention and money into AI, increasing the incentive to build GPUs, &c.), where do you draw the line? Some of the researchers I cited in my most recent post are working on “build[ing] better models of primate visual cognition”. Is that wrong? Should Judea Pearl not have published? Turing? Charles Babbage?
In asking these obnoxious questions, I’m not trying to make a reductio ad absurdum of caring about risk, or proposing an infinitely slippery slope where our only choices are between max accelerationism and a destroy-all-computers Butlerian Jihad. I just think it’s important to notice that “Stop thinking about AI” kind of does amount to a Butlerian Jihad (and that publishing and thinking are not unrelated)?
In contrast, this almost makes it sound like you think it is plausible to align AI to its user’s intent, but that this would be bad if the users aren’t one of “us”—you know, the good alignment researchers who want to use AI to take over the universe, totally unlike those evil capabilities researchers who want to use AI to produce economically valuable goods and services.
If I’m being honest, I don’t find this framing helpful.
If you believe that things will go well if certain actors gain access to advanced AI technologies first, you should directly argue that.
Focusing on status games feels like a red herring.
I think this is undignified.
I agree that it would be safer if humanity were a collective hivemind that could coordinate to not build AI until we know how to build the best AI, and that people should differentially work on things that make the situation better rather than worse, and that this potentially includes keeping quiet about information that would make things worse.
The problem is—as you say—”[i]t’s very rare that any research purely helps alignment”; you can’t think about aligning AI without thinking about AI. In order to navigate the machine intelligence transition in the most dignified way, you want your civilization’s best people to be doing their best thinking about the problem, and your best people can’t do their best thinking under the conditions of paranoid secrecy.
Concretely, I’ve been studying some deep learning basics lately and have written a couple posts about things I’ve learned. I think this was good, not bad. I think I and my readers have a slightly better understanding of the technology in question than if I hadn’t studied and hadn’t written, and that better understanding will help us make better decisions in expectation.
Sorry, what? I thought the fear was that we don’t know how to make helpful AI at all. (And that people who think they’re being helped by seductively helpful-sounding LLM assistants are being misled by surface appearances; the shoggoth underneath has its own desires that we won’t like when it’s powerful enough to persue them autonomously.) In contrast, this almost makes it sound like you think it is plausible to align AI to its user’s intent, but that this would be bad if the users aren’t one of “us”—you know, the good alignment researchers who want to use AI to take over the universe, totally unlike those evil capabilities researchers who want to use AI to produce economically valuable goods and services.
My steelman of this (though to be clear I think your comment makes good points):
There is a large difference between a system being helpful and a system being aligned. Ultimately AI existential risk is a coordination problem where I expect catastrophic consequences because a bunch of people want to build AGI without making it safe. Therefore making technologies that in a naive and short-term sense just help AGI developers build whatever they want to build will have bad consequences. If I trusted everyone to use their intelligence just for good things, we wouldn’t have anthropogenic existential risk on our hands.
Some of those technologies might end up useful for also getting the AI to be more properly aligned, or maybe to help with work that reduces the risk of AI catastrophe some other way, though my current sense is that kind of work is pretty different and doesn’t benefit remotely as much from generically locally-helpful AI.
In-general I feel pretty sad about conflating “alignment” with “short-term intent alignment”. I think the two problems are related but have really important crucial differences, I don’t think the latter generalizes that well to the former (for all the usual sycophancy/treacherous-turn reasons), and indeed progress on the latter IMO mostly makes the world marginally worse because the thing it is most likely to be used for is developing existentially dangerous AI systems faster.
Edit: Another really important dimension to model here is also not just the effect of that kind of research on what individual researchers will do, but what effect this kind of research will have on what the market wants to invest in. My standard story of doom is centrally rooted in there being very strong short-term individual economic incentives to build more capable AGI, enabling people to make billions to trillions of dollars, while the downside risk is a distributed negative externality that is not at all priced into the costs of AI development. Developing applications of AI that make a lot of money without accounting for the negative extinction externalities therefore can be really quite bad for the world.
The problem is that “helpful” oracle AI will not stay helpful for long, if there is any incentive to produce things which are less helpful. Your beliefs are apparently out of date: we have helpful AI now, so that’s an existence disproof of “helpful ai is impossible”. But the threat of AI being more evolutionarily fit, and possibly an AI taking sudden and intense action to make use of its being more evolutionarily fit, is still hanging over our heads; and it only takes one hyperdesperate not-what-you-meant seeker.
Concretely, I think your posts are in fact a great (but not at all worst-case) example of things that have more cost than benefit, and I think you should keep working but only talk to people in DMs. Time is very, very short, and if you accidentally have a pivotally negative impact, you could be the one that burns the last two days before the world is destroyed.
Rather, ” us” — the good alignment researchers who will be careful at all about the long term effects of our actions, unlike capabilities researchers who are happy to accelerate race dynamics and increase p(doom) if they make a quick profit out of it in the short term.
I think these judgements would benefit from more concreteness: that rather than proposing a dichotomy of “capabilities research” (them, Bad) and “alignment research” (us, Good), you could be more specific about what kinds of work you want to see more and less of.
I agree that (say) Carmack and Sutton are doing a bad thing by declaring a goal to “build AGI” while dismissing the reasons that this is incredibly dangerous. But the thing that makes infohazard concerns so fraught is that there’s a lot of work that potentially affects our civilization’s trajectory into the machine intelligence transition in complicated ways, which makes it hard to draw a boundary around “trusted alignment researchers” in a principled and not self-serving way that doesn’t collapse into “science and technology is bad”.
We can agree that OpenAI as originally conceived was a bad idea. What about the people working on music generation? That’s unambiguously “capabilities”, but it’s also not particularly optimized at ending the world that way “AGI for AGI’s sake” projects are. If that’s still bad even though music generation isn’t going to end the world (because it’s still directing attention and money into AI, increasing the incentive to build GPUs, &c.), where do you draw the line? Some of the researchers I cited in my most recent post are working on “build[ing] better models of primate visual cognition”. Is that wrong? Should Judea Pearl not have published? Turing? Charles Babbage?
In asking these obnoxious questions, I’m not trying to make a reductio ad absurdum of caring about risk, or proposing an infinitely slippery slope where our only choices are between max accelerationism and a destroy-all-computers Butlerian Jihad. I just think it’s important to notice that “Stop thinking about AI” kind of does amount to a Butlerian Jihad (and that publishing and thinking are not unrelated)?
If I’m being honest, I don’t find this framing helpful.
If you believe that things will go well if certain actors gain access to advanced AI technologies first, you should directly argue that.
Focusing on status games feels like a red herring.