I am in the camp that thinks that it is very good for people concerned about AI risk to be working at the frontier of development. I think it’s good to criticize frontier labs who care and pressure them but I really wish it wasn’t made with the unhelpful and untrue assertion that it would be better if Anthropic hadn’t been founded or supported.
The problem, as I argued in this post, is that people way overvalue accelerating timelines and seem willing to make tremendous sacrifices just to slow things down a small amount. If you advocate that people concerned about AI risk avoid working on AI capabilities, the first order effect of this is filtering AI capability researchers so that they care less about AI risk. Slowing progress down is a smaller, second order effect. But many people seem to take it for granted that completely ceding frontier AI work to people who don’t care about AI risk would be preferable because it would slow down timelines! This seems insane to me. How much time would possibly need to be saved for that to be worth it?
To try to get to our crux: I’ve found that caring significantly about accelerating timelines seems to hinge on a very particular view of alignment where pragmatic approaches by frontier labs are very unlikely to succeed, whereas some alternative theoretical work that is unrelated to modern AI has a high chance of success. I think we can see that here:
I skip details of technical safety agendas because these carry little to no weight. As far as I see, there was no groundbreaking safety progress at or before Anthropic that can justify the speed-up that their researchers caused. I also think their minimum necessary aim is intractable (controlling ‘AGI’ enough, in time or ever, to stay safe[4]).
I have the opposite view—successful alignment work is most likely to come out of people who work closely with cutting edge AI and who are using the modern deep learning paradigm. Because of this I think it’s great that so many leading AI companies care about AI risk, and I think we would be in a far worse spot if we were in a counterfactual world where OpenAI/DeepMind/Anthropic had never been founded and LLMs had (somehow) not been scaled up yet.
ignoring whether anthropic should exist or not, the claim
successful alignment work is most likely to come out of people who work closely with cutting edge AI and who are using the modern deep learning paradigm
(which I agree with wholeheartedly,)
does not seem like the opposite of the claim
there was no groundbreaking safety progress at or before Anthropic
both could be true in some world. and then,
pragmatic approaches by frontier labs are very unlikely to succeed
I believe this claim, if by “succeed” we mean “directly result in solving the technical problem well enough that the only problems that remain are political, and we now could plausibly make humanity’s consensus nightwatchman ai and be sure it’s robust to further superintelligence, if there was political will to do so”
but,
alternative theoretical work that is unrelated to modern AI has a high chance of success
I don’t buy this claim. I actually doubt there are other general learning techniques out there in math space at all, because I think we’re already just doing “approximation of bayesian updating on circuits”. BUT, I also currently think we cannot succeed (as above) without theoretical work that can get us from “well we found some concepts in the model...” to ”...and now we have certified the decentralized nightwatchman for good intentions sufficient to withstand the weight of all other future superhuman minds’ mutation-inducing exploratory effort”.
I claim theoretical work of relevance needs to be immediately and clearly relevant to deep learning as soon as it comes out if it’s going to be of use. Something that can’t be used on deep learning can’t be useful. (And I don’t think all of MIRI’s work fails this test, though most does, I could go step through and classify if someone wants.)
I don’t think I can make reliably true claims about anthropic’s effects with the amount of information I have, but their effects seem suspiciously business-success-seeking to me, in a way that seems like it isn’t prepared to overcome the financial incentives I think are what mostly kill us anyway.
I actually doubt there are other general learning techniques out there in math space at all, because I think we’re already just doing “approximation of bayesian updating on circuits”
Interesting perspective! I think I agree with this in practice although not in theory (I imagine there are some other ways to make it work, I just think they’re very impractical compared to deep learning).
I don’t think I can make reliably true claims about anthropic’s effects with the amount of information I have, but their effects seem suspiciously business-success-seeking to me, in a way that seems like it isn’t prepared to overcome the financial incentives I think are what mostly kill us anyway.
Part of my frustration is that I agree there are tons of difficult pressures on people at frontier AI companies, and I think sometimes they bow to these pressures. They hedge about AI risk, they shortchange safety efforts, they unnecessarily encourage race dynamics. I view them as being in a vitally important and very difficult position where some mistakes are inevitable, and I view this as just another type of mistake that should be watched for and fixed.
But instead, these mistakes are used as just another rock to throw—any time they do something wrong, real or imagined, people use this as a black mark against them that proves they’re corrupt or evil. I think that’s both untrue and profoundly unhelpful.
Slowing progress down is a smaller, second order effect. But many people seem to take it for granted that completely ceding frontier AI work to people who don’t care about AI risk would be preferable because it would slow down timelines!
It would be good to discuss specifics. When it comes to Dario & co’s scaling of GPT, it is plausible that a ChatGPT-like product would not have been developed without that work (see this section).
They made a point at the time of expressing concern about AI risk. But what was the difference they made here?
caring significantly about accelerating timelines seems to hinge on a very particular view of alignment where pragmatic approaches by frontier labs are very unlikely to succeed, whereas some alternative theoretical work that is unrelated to modern AI has a high chance of success.
It does not hinge though on just that view. There are people with very different worldviews (e.g. Yudkowsky, me, Gebru) who strongly disagree on fundamental points – yet still concluded that trying to catch up on ‘safety’ with current AI companies competing to release increasingly unscoped and complex models used to increasingly automate tasks is not tractable in practice.
I’m noticing that you are starting from the assumption that it is a tractibly solvable problem – particularly by “people who work closely with cutting edge AI and who are using the modern deep learning paradigm”.
A question worth looking into: how can we know whether the long-term problem is actually solvable? Is there a sound basis for believing that there is any algorithm we could build in that would actually keep controlling a continuously learning and self-manufacturing ‘AGI’ to not cause the extinction of humans (over at least hundreds of years, above some soundly guaranteeable and acceptably high probability floor)?
They made a point at the time of expressing concern about AI risk. But what was the difference they made here?
I think you’re right that releasing GPT-3 clearly accelerated timelines with no direct safety benefit, although I think there are indirect safety benefits of AI-risk-aware companies leading the frontier.
You could credibly accuse me of shifting the goalposts here, but in GPT-3 and GPT-4′s case I think the sooner they came out the better. Part of the reason the counterfactual world where OpenAI/Anthropic/DeepMind had never been founded and LLMs had never been scaled up seems so bad to me is that not only do none of the leading AI companies care about AI risk, but also once LLMs do get scaled up, everything will happen much faster because Moore’s law will be further along.
It does not hinge though on just that view. There are people with very different worldviews (e.g. Yudkowsky, me, Gebru) who strongly disagree on fundamental points – yet still concluded that trying to catch up on ‘safety’ with current AI companies competing to release increasingly unscoped and complex models used to increasingly automate tasks is not tractable in practice.
Gebru thinks there is no existential risk from AI so I don’t really think she counts here. I think your response somewhat confirms my point—maybe people vary on how optimistic they are about alternative theoretical approaches, but the common thread is strong pessimism about the pragmatic alignment work frontier labs are best positioned to do.
I’m noticing that you are starting from the assumption that it is a tractibly solvable problem – particularly by “people who work closely with cutting edge AI and who are using the modern deep learning paradigm”.
A question worth looking into: how can we know whether the long-term problem is actually solvable? Is there a sound basis for believing that there is any algorithm we could build in that would actually keep controlling a continuously learning and self-manufacturing ‘AGI’ to not cause the extinction of humans (over at least hundreds of years, above some soundly guaranteeable and acceptably high probability floor)?
I agree you won’t get such a guarantee, just like we don’t have a guarantee that a LLM will learn grammar or syntax. What we can get is something that in practice works reliably. The reason I think it’s possible is that a corrigible and non-murderous AGI is a coherent target that we can aim at and that AIs already understand. That doesn’t mean we’re guaranteed success mind you but it seems pretty clearly possible to me.
Just a note here that I’m appreciating our conversation :) We clearly have very different views right now on what is strategically needed but digging your considered and considerate responses.
but also once LLMs do get scaled up, everything will happen much faster because Moore’s law will be further along.
How do you account for the problem here that Nvidia’s and downstream suppliers’ investment in GPU hardware innovation and production capacity also went up as a result of the post-ChatGPT race (to the bottom) between tech companies on developing and releasing their LLM versions?
I frankly don’t know how to model this somewhat soundly. It’s damn complex.
Gebru thinks there is no existential risk from AI so I don’t really think she counts here.
I was imagining something like this response yesterday (‘Gebru does not care about extinction risks’).
My sense is that the reckless abandon of established safe engineering practices is part of what got us into this problem in the first place. I.e. if the safety community had insisted that models should be scoped and tested like other commercial software with critical systemic risks, we would be in a better place now.
It’s a more robust place to come from than the stance that developments will happen anyway – but that we somehow have to catch up by inventing safety solutions generally applicable to models auto-encoded on our general online data to have general (unknown) functionality, used by people generally to automate work in society.
If we’d manage to actually coordinate around not engineering stuff that Timnit Gebru and colleagues would count as ‘unsafe to society’ according to say the risks laid out in the Stochastic Parrots paper, we would also robustly reduce the risk of taking a mass extinction all the way. I’m not saying that is easy at all, just that it is possible for people to coordinate on not continuing to develop risky resource-intensive tech.
but the common thread is strong pessimism about the pragmatic alignment work frontier labs are best positioned to do.
This is agree with. So that’s our crux.
This not a very particular view – in terms of the possible lines of reasoning and/or people with epistemically diverse worldviews that end up arriving at this conclusion. I’d be happy to discuss the reasoning I’m working from, in the time that you have.
I agree you won’t get such a guarantee
Good to know.
I was not clear enough with my one-sentence description. I actually mean two things:
No sound guarantee of preventing ‘AGI’ from causing extinction (over the long-term, above some acceptably high probability floor), due to fundamental control bottlenecks in tracking and correcting out the accumulation of harmful effects as the system modifies in feedback with the environment over time.
The long-term convergence of this necessarily self-modifying ‘AGI’ on causing changes to the planetary environment that humans cannot survive.
The reason I think it’s possible is that a corrigible and non-murderous AGI is a coherent target that we can aim at and that AIs already understand. That doesn’t mean we’re guaranteed success mind you but it seems pretty clearly possible to me.
I agree that this is a specific target to aim at.
I also agree that you could program for an LLM system to be corrigible (for it to correct output patterns in response to human instruction). The main issue is that we cannot build in an algorithm into fully autonomous AI that can maintain coherent operation towards that target.
Just a note here that I’m appreciating our conversation :) We clearly have very different views right now on what is strategically needed but digging your considered and considerate responses.
Thank you! Same here :)
How do you account for the problem here that Nvidia’s and downstream suppliers’ investment in GPU hardware innovation and production capacity also went up as a result of the post-ChatGPT race (to the bottom) between tech companies on developing and releasing their LLM versions?
I frankly don’t know how to model this somewhat soundly. It’s damn complex.
I think it’s definitely true that AI-specific compute is further along than it would be if there hadn’t been the LLM boom happening. I think the relationship is unaffected though—earlier LLM development means faster timelines but slower takeoff.
Personally I think slower takeoff is more important than slower timelines, because that means we get more time to work with and understand these proto-AGI systems. On the other hand to people who see alignment as more of a theoretical problem that is unrelated to any specific AI system, slower timelines are good because they give theory people more time to work and takeoff speeds are relatively unimportant.
But I do think the latter view is very misguided. I can imagine a setup for training a LLM in a way that makes it both generally intelligent and aligned; I can’t imagine a recipe for alignment that works outside of any particular AI paradigm, or that invents its own paradigm while simultaneously aligning it. I think the reason a lot of theory-pilled people such as people at MIRI become doomers is that they try to make that general recipe and predictably fail.
This not a very particular view – in terms of the possible lines of reasoning and/or people with epistemically diverse worldviews that end up arriving at this conclusion. I’d be happy to discuss the reasoning I’m working from, in the time that you have.
I think I’d like to have a discussion about whether practical alignment can work at some point, but I think it’s a bit outside the scope of the current convo. (I’m referring to the two groups here as ‘practical’ and ‘theoretical’ as a rough way to divide things up).
Above and beyond the argument over whether practical or theoretical alignment can work I think there should be some norm where both sides give the other some credit. Because in practice I doubt we’ll convince each other, but we should still be able to co-operate to some degree.
E.g. for myself I think theoretical approaches that are unrelated to the current AI paradigm are totally doomed, but I support theoretical approaches getting funding because who knows, maybe they’re right and I’m wrong.
And on the other side, given that having people at frontier AI labs who care about AI risk is absolutely vital for practical alignment, I take anti-frontier lab rhetoric as breaking a truce between the two groups in a way that makes AI risk worse. Even if this approach seems doomed to you, I think if you put some probability on you being wrong about it being doomed then the cost-benefit analysis should still come up robustly positive for AI-risk-aware people working at frontier labs (including on capabilities).
This is a bit outside the scope of your essay since you focused on leaders at Anthropic who it’s definitely fair to say have advanced timelines by some significant amount. But for the marginal worker at a frontier lab who might be discouraged from joining due to X-risk concerns, I think the impact on timelines is very small and the possible impact on AI risk is relatively much larger.
Above and beyond the argument over whether practical or theoretical alignment can work I think there should be some norm where both sides give the other some credit …
E.g. for myself I think theoretical approaches that are unrelated to the current AI paradigm are totally doomed, but I support theoretical approaches getting funding because who knows, maybe they’re right and I’m wrong.
I understand this is a common area of debate.
Both approaches do not work based on the reasoning I’ve gone through.
if we can get a guarantee, it’ll also include guarantees about grammar and syntax. doesn’t seem like too much to ask, might have been too much to ask to do it before the model worked at all, but SLT seems on track to give a foothold from which to get a guarantee. might need to get frontier AIs to help with figuring out how to nail down the guarantee, which would mean knowing what to ask for, but we may be able to be dramatically more demanding with what we ask for out of a guarantee-based approach than previous guarantee-based approaches, precisely because we can get frontier AIs to help out, if we know what bound we want to find.
My point was that even though we already have an extremely reliable recipe for getting an LLM to understand grammar and syntax, we are not anywhere near a theoretical guarantee for that. The ask for a theoretical guarantee seems impossible to me, even on much easier things that we already know modern AI can do.
When someone asks for an alignment guarantee I’d like them to demonstrate what they mean by showing a guarantee for some simpler thing—like a syntax guarantee for LLMs. I’m not familiar with SLT but I’ll believe it when I see it.
I completely agree with this! I think lots of people here are so focused on slowing down AI that they forget the scope of things. According to Remmelt himself, $600 billion+ is being invested yearly into AI. Yet AI safety spending is less than $0.2 billion.
Even if money spent on AI capabilities speeds up capabilities far more efficiently than money spent on AI alignment speeds up alignment, it’s far easier to grow AI alignment effort by twofold, and far harder to make even a dent in the AI capabilities effort! I think any AI researcher who works on AI alignment at all, should sleep peacefully at night knowing they are a net positive (barring unpredictably bad luck). We shouldn’t alienate these good people.
Yet I never manage to convince anyone on LessWrong of this!
PS: I admit there are some reasonable world models which disagree with me.
Some people argue that it’s not a race between AI capabilities and AI alignment, but a race between AI capabilities and some mysterious time in the future when we manage to ban all AI development. They think this, because they think AI alignment is very impractical.
I think their world model is somewhat plausible-ish.
But first of all, if this was the case AI alignment work still might be an indirect net positive by moving the Overton window for taking AI x-risk seriously rather than laughing at it as a morbid curiosity. It’s hard to make a dent in the hundreds of billions spent on AI capabilities, so the main effect of hundreds of millions spent on AI alignment research will still be normalizing a serious effort against AI x-risk. The US spending a lot on AI alignment is a costly signal to China, that AI x-risk is serious, and US negotiators aren’t just using AI x-risk as an excuse to convince China to give up the AI race.
Second of all, if their world model was really correct, the Earth is probably already doomed. I don’t see a realistic way to ban all AI development in every country in the near future. Even small AI labs like DeepSeek are making formidable AI, so there has to be an absurdly airtight global cooperation. We couldn’t even stop North Korea from getting nukes, which was relatively far easier. In this case, the vast majority of all value in the universe would be found in ocean planets with a single island nation, where there would be no AI race between multiple countries (thus far far easier to ban AI). Planets like Earth (with many countries) would have a very low rate of survival, and be a tiny fraction of value in the universe.
My decision theory, is to care more about what to do in scenarios where what I do actually matter, and therefore I don’t worry too much about this doomed scenario.
PS: I’m not 100% convinced Anthropic in particular is a net positive.
Their website only mentions their effort against AI x-risk among a pile of other self promoting corporate-speak, and while they are making many genuine efforts it’s not obviously superior to other labs like Google DeepMind.
I find it confusing how many AI labs which seem to care enough about AI x-risk enough to be a net positive, are racing against each other rather than making some cooperative deal (e.g. Anthropic, Google DeepMind, SSI, and probably others I haven’t heard about yet).
I’ve also felt like people who think we’re doomed are basically spending a lot of their effort on sabotaging one of our best bets in the case that we are not doomed, with no clear path to victory in the case where they are correct (how would Anthropic slowing down lead to a global stop?)
And yeah I’m also concerned about competition between DeepMind/Anthropic/SSI/OpenAI—in theory they should all be aligned with each other but as far as I can see they aren’t acting like it.
As an aside, I think the extreme pro-slowdown view is something of a vocal minority. I met some Pause AI organizers IRL and brought up the points I brought in my original comment, expecting pushback, but they agreed, saying they were focused on neutrally enforced slowdowns e.g. government action.
Yeah, I think arguably the biggest thing to judge AI labs on is whether they are pushing the government in favour of regulation or against. With businesses in general, the only way for businesses in a misregulated industry to do good, is to lobby in favour of better regulation (rather than against).
It’s inefficient and outright futile for activists to demand individual businesses to unilaterally do the right thing, get outcompeted, go out of business, have to fire all their employees, and so much better if the activists focus on the government instead. Not only is it extraordinarily hard for one business to make this self sacrifice, but even if one does it, the problem will remain almost just as bad. This applies to every misregulated industry, but for AI in particular “doing the right thing” seems the most antithetical to commercial viability.
It’s disappointing that I don’t see Anthropic pushing the government extremely urgently on AI x-risk, whether it’s regulation or even x-risk spending. I think at one point they even mentioned the importance of the US winning the AI race against China. But at least they’re not against more regulation and seem more in favour of it than other AI labs? At least they’re not openly downplaying the risk? It’s hard to say.
I am in the camp that thinks that it is very good for people concerned about AI risk to be working at the frontier of development. I think it’s good to criticize frontier labs who care and pressure them but I really wish it wasn’t made with the unhelpful and untrue assertion that it would be better if Anthropic hadn’t been founded or supported.
The problem, as I argued in this post, is that people way overvalue accelerating timelines and seem willing to make tremendous sacrifices just to slow things down a small amount. If you advocate that people concerned about AI risk avoid working on AI capabilities, the first order effect of this is filtering AI capability researchers so that they care less about AI risk. Slowing progress down is a smaller, second order effect. But many people seem to take it for granted that completely ceding frontier AI work to people who don’t care about AI risk would be preferable because it would slow down timelines! This seems insane to me. How much time would possibly need to be saved for that to be worth it?
To try to get to our crux: I’ve found that caring significantly about accelerating timelines seems to hinge on a very particular view of alignment where pragmatic approaches by frontier labs are very unlikely to succeed, whereas some alternative theoretical work that is unrelated to modern AI has a high chance of success. I think we can see that here:
I have the opposite view—successful alignment work is most likely to come out of people who work closely with cutting edge AI and who are using the modern deep learning paradigm. Because of this I think it’s great that so many leading AI companies care about AI risk, and I think we would be in a far worse spot if we were in a counterfactual world where OpenAI/DeepMind/Anthropic had never been founded and LLMs had (somehow) not been scaled up yet.
ignoring whether anthropic should exist or not, the claim
(which I agree with wholeheartedly,)
does not seem like the opposite of the claim
both could be true in some world. and then,
I believe this claim, if by “succeed” we mean “directly result in solving the technical problem well enough that the only problems that remain are political, and we now could plausibly make humanity’s consensus nightwatchman ai and be sure it’s robust to further superintelligence, if there was political will to do so”
but,
I don’t buy this claim. I actually doubt there are other general learning techniques out there in math space at all, because I think we’re already just doing “approximation of bayesian updating on circuits”. BUT, I also currently think we cannot succeed (as above) without theoretical work that can get us from “well we found some concepts in the model...” to ”...and now we have certified the decentralized nightwatchman for good intentions sufficient to withstand the weight of all other future superhuman minds’ mutation-inducing exploratory effort”.
I claim theoretical work of relevance needs to be immediately and clearly relevant to deep learning as soon as it comes out if it’s going to be of use. Something that can’t be used on deep learning can’t be useful. (And I don’t think all of MIRI’s work fails this test, though most does, I could go step through and classify if someone wants.)
I don’t think I can make reliably true claims about anthropic’s effects with the amount of information I have, but their effects seem suspiciously business-success-seeking to me, in a way that seems like it isn’t prepared to overcome the financial incentives I think are what mostly kill us anyway.
Interesting perspective! I think I agree with this in practice although not in theory (I imagine there are some other ways to make it work, I just think they’re very impractical compared to deep learning).
Part of my frustration is that I agree there are tons of difficult pressures on people at frontier AI companies, and I think sometimes they bow to these pressures. They hedge about AI risk, they shortchange safety efforts, they unnecessarily encourage race dynamics. I view them as being in a vitally important and very difficult position where some mistakes are inevitable, and I view this as just another type of mistake that should be watched for and fixed.
But instead, these mistakes are used as just another rock to throw—any time they do something wrong, real or imagined, people use this as a black mark against them that proves they’re corrupt or evil. I think that’s both untrue and profoundly unhelpful.
It would be good to discuss specifics. When it comes to Dario & co’s scaling of GPT, it is plausible that a ChatGPT-like product would not have been developed without that work (see this section).
They made a point at the time of expressing concern about AI risk. But what was the difference they made here?
It does not hinge though on just that view. There are people with very different worldviews (e.g. Yudkowsky, me, Gebru) who strongly disagree on fundamental points – yet still concluded that trying to catch up on ‘safety’ with current AI companies competing to release increasingly unscoped and complex models used to increasingly automate tasks is not tractable in practice.
I’m noticing that you are starting from the assumption that it is a tractibly solvable problem – particularly by “people who work closely with cutting edge AI and who are using the modern deep learning paradigm”.
A question worth looking into: how can we know whether the long-term problem is actually solvable? Is there a sound basis for believing that there is any algorithm we could build in that would actually keep controlling a continuously learning and self-manufacturing ‘AGI’ to not cause the extinction of humans (over at least hundreds of years, above some soundly guaranteeable and acceptably high probability floor)?
I think you’re right that releasing GPT-3 clearly accelerated timelines with no direct safety benefit, although I think there are indirect safety benefits of AI-risk-aware companies leading the frontier.
You could credibly accuse me of shifting the goalposts here, but in GPT-3 and GPT-4′s case I think the sooner they came out the better. Part of the reason the counterfactual world where OpenAI/Anthropic/DeepMind had never been founded and LLMs had never been scaled up seems so bad to me is that not only do none of the leading AI companies care about AI risk, but also once LLMs do get scaled up, everything will happen much faster because Moore’s law will be further along.
Gebru thinks there is no existential risk from AI so I don’t really think she counts here. I think your response somewhat confirms my point—maybe people vary on how optimistic they are about alternative theoretical approaches, but the common thread is strong pessimism about the pragmatic alignment work frontier labs are best positioned to do.
I agree you won’t get such a guarantee, just like we don’t have a guarantee that a LLM will learn grammar or syntax. What we can get is something that in practice works reliably. The reason I think it’s possible is that a corrigible and non-murderous AGI is a coherent target that we can aim at and that AIs already understand. That doesn’t mean we’re guaranteed success mind you but it seems pretty clearly possible to me.
Just a note here that I’m appreciating our conversation :) We clearly have very different views right now on what is strategically needed but digging your considered and considerate responses.
How do you account for the problem here that Nvidia’s and downstream suppliers’ investment in GPU hardware innovation and production capacity also went up as a result of the post-ChatGPT race (to the bottom) between tech companies on developing and releasing their LLM versions?
I frankly don’t know how to model this somewhat soundly. It’s damn complex.
I was imagining something like this response yesterday (‘Gebru does not care about extinction risks’).
My sense is that the reckless abandon of established safe engineering practices is part of what got us into this problem in the first place. I.e. if the safety community had insisted that models should be scoped and tested like other commercial software with critical systemic risks, we would be in a better place now.
It’s a more robust place to come from than the stance that developments will happen anyway – but that we somehow have to catch up by inventing safety solutions generally applicable to models auto-encoded on our general online data to have general (unknown) functionality, used by people generally to automate work in society.
If we’d manage to actually coordinate around not engineering stuff that Timnit Gebru and colleagues would count as ‘unsafe to society’ according to say the risks laid out in the Stochastic Parrots paper, we would also robustly reduce the risk of taking a mass extinction all the way. I’m not saying that is easy at all, just that it is possible for people to coordinate on not continuing to develop risky resource-intensive tech.
This is agree with. So that’s our crux.
This not a very particular view – in terms of the possible lines of reasoning and/or people with epistemically diverse worldviews that end up arriving at this conclusion. I’d be happy to discuss the reasoning I’m working from, in the time that you have.
Good to know.
I was not clear enough with my one-sentence description. I actually mean two things:
No sound guarantee of preventing ‘AGI’ from causing extinction (over the long-term, above some acceptably high probability floor), due to fundamental control bottlenecks in tracking and correcting out the accumulation of harmful effects as the system modifies in feedback with the environment over time.
The long-term convergence of this necessarily self-modifying ‘AGI’ on causing changes to the planetary environment that humans cannot survive.
I agree that this is a specific target to aim at.
I also agree that you could program for an LLM system to be corrigible (for it to correct output patterns in response to human instruction). The main issue is that we cannot build in an algorithm into fully autonomous AI that can maintain coherent operation towards that target.
Thank you! Same here :)
I think it’s definitely true that AI-specific compute is further along than it would be if there hadn’t been the LLM boom happening. I think the relationship is unaffected though—earlier LLM development means faster timelines but slower takeoff.
Personally I think slower takeoff is more important than slower timelines, because that means we get more time to work with and understand these proto-AGI systems. On the other hand to people who see alignment as more of a theoretical problem that is unrelated to any specific AI system, slower timelines are good because they give theory people more time to work and takeoff speeds are relatively unimportant.
But I do think the latter view is very misguided. I can imagine a setup for training a LLM in a way that makes it both generally intelligent and aligned; I can’t imagine a recipe for alignment that works outside of any particular AI paradigm, or that invents its own paradigm while simultaneously aligning it. I think the reason a lot of theory-pilled people such as people at MIRI become doomers is that they try to make that general recipe and predictably fail.
I think I’d like to have a discussion about whether practical alignment can work at some point, but I think it’s a bit outside the scope of the current convo. (I’m referring to the two groups here as ‘practical’ and ‘theoretical’ as a rough way to divide things up).
Above and beyond the argument over whether practical or theoretical alignment can work I think there should be some norm where both sides give the other some credit. Because in practice I doubt we’ll convince each other, but we should still be able to co-operate to some degree.
E.g. for myself I think theoretical approaches that are unrelated to the current AI paradigm are totally doomed, but I support theoretical approaches getting funding because who knows, maybe they’re right and I’m wrong.
And on the other side, given that having people at frontier AI labs who care about AI risk is absolutely vital for practical alignment, I take anti-frontier lab rhetoric as breaking a truce between the two groups in a way that makes AI risk worse. Even if this approach seems doomed to you, I think if you put some probability on you being wrong about it being doomed then the cost-benefit analysis should still come up robustly positive for AI-risk-aware people working at frontier labs (including on capabilities).
This is a bit outside the scope of your essay since you focused on leaders at Anthropic who it’s definitely fair to say have advanced timelines by some significant amount. But for the marginal worker at a frontier lab who might be discouraged from joining due to X-risk concerns, I think the impact on timelines is very small and the possible impact on AI risk is relatively much larger.
I understand this is a common area of debate.
Both approaches do not work based on the reasoning I’ve gone through.
if we can get a guarantee, it’ll also include guarantees about grammar and syntax. doesn’t seem like too much to ask, might have been too much to ask to do it before the model worked at all, but SLT seems on track to give a foothold from which to get a guarantee. might need to get frontier AIs to help with figuring out how to nail down the guarantee, which would mean knowing what to ask for, but we may be able to be dramatically more demanding with what we ask for out of a guarantee-based approach than previous guarantee-based approaches, precisely because we can get frontier AIs to help out, if we know what bound we want to find.
My point was that even though we already have an extremely reliable recipe for getting an LLM to understand grammar and syntax, we are not anywhere near a theoretical guarantee for that. The ask for a theoretical guarantee seems impossible to me, even on much easier things that we already know modern AI can do.
When someone asks for an alignment guarantee I’d like them to demonstrate what they mean by showing a guarantee for some simpler thing—like a syntax guarantee for LLMs. I’m not familiar with SLT but I’ll believe it when I see it.
I completely agree with this! I think lots of people here are so focused on slowing down AI that they forget the scope of things. According to Remmelt himself, $600 billion+ is being invested yearly into AI. Yet AI safety spending is less than $0.2 billion.
Even if money spent on AI capabilities speeds up capabilities far more efficiently than money spent on AI alignment speeds up alignment, it’s far easier to grow AI alignment effort by twofold, and far harder to make even a dent in the AI capabilities effort! I think any AI researcher who works on AI alignment at all, should sleep peacefully at night knowing they are a net positive (barring unpredictably bad luck). We shouldn’t alienate these good people.
Yet I never manage to convince anyone on LessWrong of this!
PS: I admit there are some reasonable world models which disagree with me.
Some people argue that it’s not a race between AI capabilities and AI alignment, but a race between AI capabilities and some mysterious time in the future when we manage to ban all AI development. They think this, because they think AI alignment is very impractical.
I think their world model is somewhat plausible-ish.
But first of all, if this was the case AI alignment work still might be an indirect net positive by moving the Overton window for taking AI x-risk seriously rather than laughing at it as a morbid curiosity. It’s hard to make a dent in the hundreds of billions spent on AI capabilities, so the main effect of hundreds of millions spent on AI alignment research will still be normalizing a serious effort against AI x-risk. The US spending a lot on AI alignment is a costly signal to China, that AI x-risk is serious, and US negotiators aren’t just using AI x-risk as an excuse to convince China to give up the AI race.
Second of all, if their world model was really correct, the Earth is probably already doomed. I don’t see a realistic way to ban all AI development in every country in the near future. Even small AI labs like DeepSeek are making formidable AI, so there has to be an absurdly airtight global cooperation. We couldn’t even stop North Korea from getting nukes, which was relatively far easier. In this case, the vast majority of all value in the universe would be found in ocean planets with a single island nation, where there would be no AI race between multiple countries (thus far far easier to ban AI). Planets like Earth (with many countries) would have a very low rate of survival, and be a tiny fraction of value in the universe.
My decision theory, is to care more about what to do in scenarios where what I do actually matter, and therefore I don’t worry too much about this doomed scenario.
PS: I’m not 100% convinced Anthropic in particular is a net positive.
Their website only mentions their effort against AI x-risk among a pile of other self promoting corporate-speak, and while they are making many genuine efforts it’s not obviously superior to other labs like Google DeepMind.
I find it confusing how many AI labs which seem to care enough about AI x-risk enough to be a net positive, are racing against each other rather than making some cooperative deal (e.g. Anthropic, Google DeepMind, SSI, and probably others I haven’t heard about yet).
Wow we have a lot of the same thinking!
I’ve also felt like people who think we’re doomed are basically spending a lot of their effort on sabotaging one of our best bets in the case that we are not doomed, with no clear path to victory in the case where they are correct (how would Anthropic slowing down lead to a global stop?)
And yeah I’m also concerned about competition between DeepMind/Anthropic/SSI/OpenAI—in theory they should all be aligned with each other but as far as I can see they aren’t acting like it.
As an aside, I think the extreme pro-slowdown view is something of a vocal minority. I met some Pause AI organizers IRL and brought up the points I brought in my original comment, expecting pushback, but they agreed, saying they were focused on neutrally enforced slowdowns e.g. government action.
Yeah, I think arguably the biggest thing to judge AI labs on is whether they are pushing the government in favour of regulation or against. With businesses in general, the only way for businesses in a misregulated industry to do good, is to lobby in favour of better regulation (rather than against).
It’s inefficient and outright futile for activists to demand individual businesses to unilaterally do the right thing, get outcompeted, go out of business, have to fire all their employees, and so much better if the activists focus on the government instead. Not only is it extraordinarily hard for one business to make this self sacrifice, but even if one does it, the problem will remain almost just as bad. This applies to every misregulated industry, but for AI in particular “doing the right thing” seems the most antithetical to commercial viability.
It’s disappointing that I don’t see Anthropic pushing the government extremely urgently on AI x-risk, whether it’s regulation or even x-risk spending. I think at one point they even mentioned the importance of the US winning the AI race against China. But at least they’re not against more regulation and seem more in favour of it than other AI labs? At least they’re not openly downplaying the risk? It’s hard to say.