I think a lot of rationalists accepted these Molochian offers (“build the Torment Nexus before others do it”, “invest in the Torment Nexus and spend the proceeds on Torment Nexus safety”) and the net result is simply that the Nexus is getting built earlier, with most safety work ending up as enabling capabilities or safetywashing. The rewards promised by Moloch have a way of receding into the future as the arms race expands, while the harms are already here and growing.
Consider the investments in AI by people like Jaan. It’s possible for them to increase funding for the things they think are most helpful by sizable proportions while increasing capital for AI by <1%: there are now trillions of dollars of market cap in AI securities (so 1% of that is tens of billions of dollars), and returns have been very high. You can take a fatalist stance that nothing can be done to help with resources, but if there are worthwhile things to do then it’s very plausible that for such people it works out.
If you’re going to say that e.g. export controls on AI chips and fabrication equipment, non-lab AIS research (e.g. Redwood), the CAIS extinction risk letter, and similar have meaningful expected value, or are concerned about the funding limits for such activity now,
I don’t think any of these orgs are funding constrained, so I am kind of confused what the point of this is. It seems like without these investments all of these projects have little problem attracting funding, and the argument would need to be that we could find $10B+ of similarly good opportunities, which seems unlikely.
More broadly, it’s possible to increase funding for those things by 100%+ while increasing capital for AI by <1%
I think it’s not that hard to end up with BOTECs where the people who have already made these investments ended up causing more harm than good (correlated with them having made the investments).
In-general, none of this area is funding constrained approximately at all, or will very soon stop being so when $30B+ of Anthropic equity starts getting distributed. The funding decisions are largely downstream of political and reputational constraints, not aggregate availability of funding. A more diversified funding landscape would change some of this, but in the same way as there is a very surprising amount of elite coordination among national COVID policies, there is also a very surprising amount of coordination among funders, and additional funding can often be used to prevent projects coming into existence as well as it can be used to create good projects (by actively recruiting away top talent, or use the status associated with being a broad distributor of funding to threaten social punishment on people associated with projects you think are bad).
I know of a few projects that are funding constrained, but we are talking about at most $30M of projects that I can identify as solidly net-positive, and the cause of those projects not being funded has approximately nothing to do with there not being more funding around, but mostly political constraints. Furthermore, my guess is I would rather have those $30M projects go unfunded if that prevents an additional $5B+ ending up with the kind of funders who are mood-affiliated with AI safety, who I expect to then use that money largely to drive a faster race by backing some capability company or something similarly ill-advised (probably with much more leverage than other funding, resulting in a non-trivial speed-up, because the people involved will actually have a model of what drives AGI progress).
Like, I have done lots of BOTECs here, and it just doesn’t seem that weird to think that the kind of people who end up investing in these companies ultimately use that money for more harm than good. In contrast, it feels weird to me how when I have talked to people who present me clever BOTECs about how it’s a good idea to do these kinds of investments always suspiciously never have any term in their BOTECs that would even allow for having a negative impact, either of themselves or of the people they are hoping to convince to follow the advice of investing.
My current guess is Jaan is somewhat of a notable outlier here, but Jaan also has policies about trying to not end up exposed to the downsides of this that are radically different from basically anyone else I’ve seen act in the space. At least in my interactions with him I’ve seen him do a ton of stuff to be less of a target of the social and political pressures associated with this kind of thing, and to do a lot to shield the people he asks to distribute his funding from those pressures.
I think the default expectation is that if someone ends up owning $5M+ of equity that is heavily correlated with capabilities progress, that this will cause them to do dumb stuff that makes the problem worse, not better, because they now have a substantial stake in making AI progress faster and a strong hesitation to slow it down. You can work around this if you take the actual problem of epistemic distortion seriously, as I do think Jaan has done, but if you advocate for people to do this without taking the actual corruption concerns seriously, you just end up making things worse.
All good points and I wanted to reply with some of them, so thanks. But there’s also another point where I might disagree more with LW folks (including you and Carl and maybe even Wei): I no longer believe that technological whoopsie is the main risk. I think we have enough geniuses working on the thing that technological whoopsie probably won’t happen. The main risk to me now is that AI gets pretty well aligned to money and power, and then money and power throws most humans by the wayside. I’ve mentioned it many times, the cleanest formulation is probably in this book review.
In that light, Redwood and others are just making better tools for money and power, to help align AI to their ends. Export controls are a tool of international conflict: if they happen, they happen as part of a package of measures which basically intensify the arms race. And even the CAIS letter is now looking to me like a bit of PR move, where Altman and others got to say they cared about risk and then went on increasing risk anyway. Not to mention the other things done by safety-conscious money, like starting OpenAI and Anthropic. You could say the biggest things that safety-conscious money achieved were basically enabling stuff that money and power wanted. So the endgame wouldn’t be some kind of war between humans and AI, it would be AI simply joining up with money and power, and cutting out everyone else.
My response to this is similar to my response to Will MacAskill’s suggestion to work on reducing the risk of AI-enabled coups: I’m pretty worried about this, but equally or even more worried about broader alignment “success”, e.g., if AI was aligned to humanity as a whole, or everyone got their own AI representative, or something like that, because I generally don’t trust humans to have (or end up with) good values by default. See theseposts for some reasons why.
However I think it’s pretty plausible that there’s a technological solution to this (although we’re not on track for achieving it), for example if it’s actually wrong (from their own perspective) for the rich and power to treat everyone else badly, and AIs are designed to be philosophically competent and as a result help their users realize and fix their moral/philosophical mistakes.
Since you don’t seem to think there’s a technological solution to this, what do you envision as a good outcome?
First, I think there’s enough overlap between different reasoning skills that we should expect a smarter than human AI to be really good at most such skills, including philosophy. So this part is ok.
Second, I don’t think philosophical skill alone is enough to figure out the right morality. For example, let’s say you like apples but don’t like oranges. Then when choosing between philosophical theory X, which says apples are better than oranges, and theory Y which says the opposite, you’ll use the pre-theoretic intuition as a tiebreak. And I think when humans do moral philosophy, they often do exactly that: they fall back on pre-theoretic intuitions to check what’s palatable and what isn’t. It’s a tree with many choices, and even big questions like consequentialism vs deontology vs virtue ethics may ultimately depend on many such case by case intuitions, not just pure philosophical reasoning.
Third, I think morality is part of culture. It didn’t come from the nature of an individual person: kids are often cruel. It came from constraints that people put on each other, and cultural generalization of these constraints. “Don’t kill.” When someone gets powerful enough to ignore these constraints, the default outcome we should expect is amorality. “Power corrupts.” Though of course there can be exceptions.
Fourth—and this is the payoff—I think the only good outcome is if the first smarter than human AIs start out with “good” culture, derived from what human societies think is good. Not aligned to an individual human operator, and certainly not to money and power. Then AIs can take it from there and we’ll be ok. But I don’t know how to achieve that. It might require human organizational forms that are not money- or power-seeking. I wrote a question about it sometime ago, but didn’t get any answers.
First, I think there’s enough overlap between different reasoning skills that we should expect a smarter than human AI to be really good at most such skills, including philosophy. So this part is ok.
Supposing this is true, how would you elicit this capability? In other words, how would you train the AI (e.g., what reward signal would you use) to tell humans when they (the humans) are making philosophical mistakes, and present humans with only true philosophical arguments/explanations? (As opposed to presenting the most convincing arguments, which may exploit flaws in human’s psychology or reasoning, or telling the humans what they most want to hear or what’s most likely to get a thumb up or high rating.)
Fourth—and this is the payoff—I think the only good outcome is if the first smarter than human AIs start out with “good” culture, derived from what human societies think is good.
“What human societies think is good” is filled with pretty crazy stuff, like wokeness imposing its skewed moral priorities and empirical beliefs on everyone via “cancel culture”, and religions condemning “sinners” and nonbelievers to eternal torture. Morality is Scary talks about why this is generally the case, why we shouldn’t expect “what human societies think is good” to actually be good.
Also, wouldn’t “power corrupts” apply to humanity as a whole if we manage to solve technical alignment and not align ASI to the current “power and money”? Won’t humanity be the “power and money” post-Singularity, e.g., each human or group of humans will have enough resources to create countless minds and simulations to lord over?
I’m hoping that both problems (“morality is scary” and “power corrupts”) are philosophical errors that have technical solutions in AI design (i.e., AIs can be designed to help humans avoid/fix these errors), but this is highly neglected and seems unlikely to happen by default.
I’m not very confident, but will try to explain where the intuition comes from.
Basically I think the idea of “good” might be completely cultural. As in, if you extrapolate what an individual wants, that’s basically a world optimized for that individual’s selfishness; then there is what groups can agree on by rational negotiation, which is a kind of group selfishness, cutting out everyone who’s weak enough (so for example factory farming would be ok because animals can’t fight back); and on top of that there is the abstract idea of “good”, saying you shouldn’t hurt the weak at all. And that idea is not necessitated by rational negotiation. It’s just a cultural artifact that we ended up with, I’m not sure how.
So if you ask AI to optimize for what individuals want, and go through negotiations and such, there seems a high chance that the resulting world won’t contain “good” at all, only what I called group selfishness. Even if we start with individuals who strongly believe in the cultural idea of good, they can still get corrupted by power. The only way to get “good” is to point AI at the cultural idea to begin with.
You are of course right that culture also contains a lot of nasty stuff. The only way to get something good out of it is with a bunch of extrapolation, philosophy, and yeah I don’t know what else. It’s not reliable. But the starting materials for “good” are contained only there. Hope that makes sense.
Also to your other question: how to train philosophical ability? I think yeah, there isn’t any reliable reward signal, just as there wasn’t for us. The way our philosophical ability seems to work is by learning heuristics and ways of reasoning from fields where verification is possible (like math, or everyday common sense) and applying them to philosophy. And it’s very unreliable of course. So for AIs maybe this kind of carry-over to philosophy is also the best we can hope for.
Thanks for this explanation, it definitely makes your position more understandable.
and on top of that there is the abstract idea of “good”, saying you shouldn’t hurt the weak at all. And that idea is not necessitated by rational negotiation. It’s just a cultural artifact that we ended up with, I’m not sure how.
I can think of 2 ways:
It ended up there the same way that all the “nasty stuff” ended up in our culture, more or less randomly, e.g. through the kind of “morality as status game” talked about in Will Storr’s book, which I quote in Morality is Scary.
It ended up there via philosophical progress, because it’s actually correct in some sense.
If it’s 1, then I’m not sure why extrapolation and philosophy will pick out the “good” and leave the “nasty stuff”. It’s not clear to me why aligning to culture would be better than aligning to individuals in that case.
If it’s 2, then we don’t need to align with culture either—AIs aligned with individuals can rederive the “good” with competent philosophy.
Does this make sense?
So for AIs maybe this kind of carry-over to philosophy is also the best we can hope for.
It seems clear that technical design or training choices can make a difference (but nobody is working on this). Consider the analogy with the US vs Chinese education system, where the US system seems to produce a lot more competence and/or interest in philosophy (relative to STEM) compared to the Chinese system. And comparing humans with LLMs, it sure seems like they’re on track to exceeding (top) human level in STEM while being significantly less competent in philosophy.
As in, if you extrapolate what an individual wants, that’s basically a world optimized for that individual’s selfishness; then there is what groups can agree on by rational negotiation, which is a kind of group selfishness, cutting out everyone who’s weak enough
I think it’s important to frame values around scopes of optimization, not just coalitions of actors. An individual then wants first of all their own life (rather than the world) optimized for that individual’s preferences. If they don’t live alone, their home might have multiple stakeholders, and so their home would be subject to group optimization, and so on.
At each step, optimization is primarily about the shared scope, and excludes most details of the smaller scopes under narrower control enclosed within. Culture and “good” would then have a lot to say about the negotiations on how group optimization takes place, but also about how the smaller enclosed scopes within the group’s purview are to be relatively left alone to their own optimization, under different preferences of corresponding smaller groups or individuals.
It may be good to not cut out everyone who’s too weak to prevent that, as the cultural content defining the rules for doing so is also preference that wants to preserve itself, whatever its origin (such as being culturally developed later than evolution-given psychological drives). And individuals are in particular carriers of culture that’s only relevant for group optimization, so group optimization culture would coordinate them into agreement on some things. I think selfishness is salient as a distinct thing only because the cultural content that concerns group optimization needs actual groups to get activated in practice, and without that activation applying selfishness way out of its scope is about as appropriate as stirring soup with a microscope.
I think this is not obviously qualitatively different from technical oopsie, and sufficiently strong technical success should be able to prevent this. But that’s partially because I think “money and power” is effectively an older, slower AI made of allocating over other minds, and both kinds of AI need to be strongly aligned to flourishing of humans. Fortunately humans with money and power generally want to use it to have nice lives, so on an individual level there should be incentive compatibility if we can find a solution which is general between them. I’m slightly hopeful Richard Ngo’s work might weigh on this, for example.
Do you have any advice for people financially exposed to capabilities progress on how not to do dumb stuff, not be targeted by political pressure, etc.?
What exactly I would advise doing depends on the scale of the money. I am assuming we are talking here about a few million dollars of exposure, not $50M+:
Diversify enough away from AI that you really genuinely know you will be personally fine even if all the AI stuff goes to zero (e.g. probably something like $2M-$3M)
Cultivate at least a few people you talk to about big career decisions who seem multiple steps removed from similarly strong incentives
Make public statements to the effect of being opposed to AI advancing rapidly. This has a few positive effects, I think:
It makes it easier for you to talk about this later when you might end up in a more pressured position (e.g. when you might end up in a position to take actions that might more seriously affect overall AI progress via e.g. work on regulation)
It reduces the degree that you end up in relationships that seem based on false premises because e.g. people assumed you would be in favor of this given your exposure (if you e.g. hold substantial stock in a company)
(To be clear, holding public positions like this isn’t everyone’s jam, and many people prefer holding no positions strongly in public)
See whether you can use your wealth to set up incentives for people to argue with you, or observe people arguing about issues you care about. I like a bunch of the way the S-Process is structured here.
It’s easy to do this in a way that ends up pretty sycophantic. I think Jaan’s stuff has generally not felt very sycophantic, in part for process reasons, and in part because he has selected for non-sycophancy.
I haven’t thought that hard about it, but I wonder whether you could also get some exposure to worlds where AI progress gets relatively suddenly halted as a result of regulation or other forms of public pressure. I can’t immediately think of a great trade on this as trading on events like this is often surprisingly hard to do well, but I can imagine there being something good here.
Related to the second bullet point, a thing a few of my friends do is to have semi-regular “career panels” where they meet with people they trust and who seem like very independent thinkers to them about their career and discuss high-level concerns about what they are doing might turn out bad for the world (as well as other failure modes). This seems pretty good to me, just as a basic social institution.
Maybe it’s worth mentioning here that Carl’s p(doom) is only ~20%[1], compared to my ~80%. (I can’t find a figure for @cousin_it, but I’m guessing it’s closer to mine than Carl’s. BTW every AI I asked started hallucinating or found a quote online from someone else and told me he said it.) It seems intuitive to me that at a higher p(doom), the voices in one’s moral parliament saying “let’s not touch this” become a lot louder.
“Depending on the day I might say one in four or one in five that we get an AI takeover that seizes control of the future, makes a much worse world than we otherwise would have had and with a big chance that we’re all killed in the process.”
I notice that this is only talking about “AI takeover” whereas my “doom” includes a bunch of other scenarios, but if Carl is significantly worried about other scenarios, he perhaps would have given a higher overall p(doom) in this interview or elsewhere.
This BOTEC attitude makes sense when you view the creation of AI technology and ai safety as a pure result of capital investment. The economic view of AI abstracts the development process to a black box which takes in investments as input and produces hardware/software out the other end. However, AI is still currently mostly driven by people. A large part of enabling AI development comes from a culture around AI, including hype, common knowledge, and social permissibility to pursue AI development as a startup/career pathway.
In that regard “AI safety people” starting AI companies, writing hype pieces that encourage natsec coded AI races, and investing in AI tech contribute far more than mere dollars. It creates a situation where AI safety as a movement becomes hopelessly confused about what someone “concerned about AI safety” should do with their life and career. The result is that someone “concerned about AI safety” can find groups and justifications for everything from protesting outside OpenAI to working as the CEO of OpenAI. I think this is intrinsically linked to the fundamental confusion behind the origins of the movement.
In short, whatever material and economic leverage investment plays produce may not be worth the dilution of the ideas and culture of AI safety as a whole. Is AI safety just going to become the next “ESG”, a thin flag of respectability draped over capabilities companies/racing companies?
I feel like it’s overestimating how good this is because post-hoc investments are so easy compared to forward-looking ones? My guess is that there was a market failure because investors were not informed about AI enough, but the market failure was smaller than 40x in 4 years. Even given AGI views, it’s hard to know where to invest. I have heard stories of AGI-bullish people making terrible predictions about which publicly traded companies had the most growth in the last 4 years.
I don’t have a strong take on what the reasonable expected gains would have been, they could have been high enough that the argument still mostly works.
I agree mostly, but I would characterize it as a small portion of rationalists accepting the offers, with part of Moloch’s prize being disproportionate amplification of their voices.
I think a lot of rationalists accepted these Molochian offers (“build the Torment Nexus before others do it”, “invest in the Torment Nexus and spend the proceeds on Torment Nexus safety”) and the net result is simply that the Nexus is getting built earlier, with most safety work ending up as enabling capabilities or safetywashing. The rewards promised by Moloch have a way of receding into the future as the arms race expands, while the harms are already here and growing.
Consider the investments in AI by people like Jaan. It’s possible for them to increase funding for the things they think are most helpful by sizable proportions while increasing capital for AI by <1%: there are now trillions of dollars of market cap in AI securities (so 1% of that is tens of billions of dollars), and returns have been very high. You can take a fatalist stance that nothing can be done to help with resources, but if there are worthwhile things to do then it’s very plausible that for such people it works out.
I don’t think any of these orgs are funding constrained, so I am kind of confused what the point of this is. It seems like without these investments all of these projects have little problem attracting funding, and the argument would need to be that we could find $10B+ of similarly good opportunities, which seems unlikely.
I think it’s not that hard to end up with BOTECs where the people who have already made these investments ended up causing more harm than good (correlated with them having made the investments).
In-general, none of this area is funding constrained approximately at all, or will very soon stop being so when $30B+ of Anthropic equity starts getting distributed. The funding decisions are largely downstream of political and reputational constraints, not aggregate availability of funding. A more diversified funding landscape would change some of this, but in the same way as there is a very surprising amount of elite coordination among national COVID policies, there is also a very surprising amount of coordination among funders, and additional funding can often be used to prevent projects coming into existence as well as it can be used to create good projects (by actively recruiting away top talent, or use the status associated with being a broad distributor of funding to threaten social punishment on people associated with projects you think are bad).
I know of a few projects that are funding constrained, but we are talking about at most $30M of projects that I can identify as solidly net-positive, and the cause of those projects not being funded has approximately nothing to do with there not being more funding around, but mostly political constraints. Furthermore, my guess is I would rather have those $30M projects go unfunded if that prevents an additional $5B+ ending up with the kind of funders who are mood-affiliated with AI safety, who I expect to then use that money largely to drive a faster race by backing some capability company or something similarly ill-advised (probably with much more leverage than other funding, resulting in a non-trivial speed-up, because the people involved will actually have a model of what drives AGI progress).
Like, I have done lots of BOTECs here, and it just doesn’t seem that weird to think that the kind of people who end up investing in these companies ultimately use that money for more harm than good. In contrast, it feels weird to me how when I have talked to people who present me clever BOTECs about how it’s a good idea to do these kinds of investments always suspiciously never have any term in their BOTECs that would even allow for having a negative impact, either of themselves or of the people they are hoping to convince to follow the advice of investing.
My current guess is Jaan is somewhat of a notable outlier here, but Jaan also has policies about trying to not end up exposed to the downsides of this that are radically different from basically anyone else I’ve seen act in the space. At least in my interactions with him I’ve seen him do a ton of stuff to be less of a target of the social and political pressures associated with this kind of thing, and to do a lot to shield the people he asks to distribute his funding from those pressures.
I think the default expectation is that if someone ends up owning $5M+ of equity that is heavily correlated with capabilities progress, that this will cause them to do dumb stuff that makes the problem worse, not better, because they now have a substantial stake in making AI progress faster and a strong hesitation to slow it down. You can work around this if you take the actual problem of epistemic distortion seriously, as I do think Jaan has done, but if you advocate for people to do this without taking the actual corruption concerns seriously, you just end up making things worse.
All good points and I wanted to reply with some of them, so thanks. But there’s also another point where I might disagree more with LW folks (including you and Carl and maybe even Wei): I no longer believe that technological whoopsie is the main risk. I think we have enough geniuses working on the thing that technological whoopsie probably won’t happen. The main risk to me now is that AI gets pretty well aligned to money and power, and then money and power throws most humans by the wayside. I’ve mentioned it many times, the cleanest formulation is probably in this book review.
In that light, Redwood and others are just making better tools for money and power, to help align AI to their ends. Export controls are a tool of international conflict: if they happen, they happen as part of a package of measures which basically intensify the arms race. And even the CAIS letter is now looking to me like a bit of PR move, where Altman and others got to say they cared about risk and then went on increasing risk anyway. Not to mention the other things done by safety-conscious money, like starting OpenAI and Anthropic. You could say the biggest things that safety-conscious money achieved were basically enabling stuff that money and power wanted. So the endgame wouldn’t be some kind of war between humans and AI, it would be AI simply joining up with money and power, and cutting out everyone else.
My response to this is similar to my response to Will MacAskill’s suggestion to work on reducing the risk of AI-enabled coups: I’m pretty worried about this, but equally or even more worried about broader alignment “success”, e.g., if AI was aligned to humanity as a whole, or everyone got their own AI representative, or something like that, because I generally don’t trust humans to have (or end up with) good values by default. See these posts for some reasons why.
However I think it’s pretty plausible that there’s a technological solution to this (although we’re not on track for achieving it), for example if it’s actually wrong (from their own perspective) for the rich and power to treat everyone else badly, and AIs are designed to be philosophically competent and as a result help their users realize and fix their moral/philosophical mistakes.
Since you don’t seem to think there’s a technological solution to this, what do you envision as a good outcome?
It’s complicated.
First, I think there’s enough overlap between different reasoning skills that we should expect a smarter than human AI to be really good at most such skills, including philosophy. So this part is ok.
Second, I don’t think philosophical skill alone is enough to figure out the right morality. For example, let’s say you like apples but don’t like oranges. Then when choosing between philosophical theory X, which says apples are better than oranges, and theory Y which says the opposite, you’ll use the pre-theoretic intuition as a tiebreak. And I think when humans do moral philosophy, they often do exactly that: they fall back on pre-theoretic intuitions to check what’s palatable and what isn’t. It’s a tree with many choices, and even big questions like consequentialism vs deontology vs virtue ethics may ultimately depend on many such case by case intuitions, not just pure philosophical reasoning.
Third, I think morality is part of culture. It didn’t come from the nature of an individual person: kids are often cruel. It came from constraints that people put on each other, and cultural generalization of these constraints. “Don’t kill.” When someone gets powerful enough to ignore these constraints, the default outcome we should expect is amorality. “Power corrupts.” Though of course there can be exceptions.
Fourth—and this is the payoff—I think the only good outcome is if the first smarter than human AIs start out with “good” culture, derived from what human societies think is good. Not aligned to an individual human operator, and certainly not to money and power. Then AIs can take it from there and we’ll be ok. But I don’t know how to achieve that. It might require human organizational forms that are not money- or power-seeking. I wrote a question about it sometime ago, but didn’t get any answers.
Supposing this is true, how would you elicit this capability? In other words, how would you train the AI (e.g., what reward signal would you use) to tell humans when they (the humans) are making philosophical mistakes, and present humans with only true philosophical arguments/explanations? (As opposed to presenting the most convincing arguments, which may exploit flaws in human’s psychology or reasoning, or telling the humans what they most want to hear or what’s most likely to get a thumb up or high rating.)
“What human societies think is good” is filled with pretty crazy stuff, like wokeness imposing its skewed moral priorities and empirical beliefs on everyone via “cancel culture”, and religions condemning “sinners” and nonbelievers to eternal torture. Morality is Scary talks about why this is generally the case, why we shouldn’t expect “what human societies think is good” to actually be good.
Also, wouldn’t “power corrupts” apply to humanity as a whole if we manage to solve technical alignment and not align ASI to the current “power and money”? Won’t humanity be the “power and money” post-Singularity, e.g., each human or group of humans will have enough resources to create countless minds and simulations to lord over?
I’m hoping that both problems (“morality is scary” and “power corrupts”) are philosophical errors that have technical solutions in AI design (i.e., AIs can be designed to help humans avoid/fix these errors), but this is highly neglected and seems unlikely to happen by default.
I’m not very confident, but will try to explain where the intuition comes from.
Basically I think the idea of “good” might be completely cultural. As in, if you extrapolate what an individual wants, that’s basically a world optimized for that individual’s selfishness; then there is what groups can agree on by rational negotiation, which is a kind of group selfishness, cutting out everyone who’s weak enough (so for example factory farming would be ok because animals can’t fight back); and on top of that there is the abstract idea of “good”, saying you shouldn’t hurt the weak at all. And that idea is not necessitated by rational negotiation. It’s just a cultural artifact that we ended up with, I’m not sure how.
So if you ask AI to optimize for what individuals want, and go through negotiations and such, there seems a high chance that the resulting world won’t contain “good” at all, only what I called group selfishness. Even if we start with individuals who strongly believe in the cultural idea of good, they can still get corrupted by power. The only way to get “good” is to point AI at the cultural idea to begin with.
You are of course right that culture also contains a lot of nasty stuff. The only way to get something good out of it is with a bunch of extrapolation, philosophy, and yeah I don’t know what else. It’s not reliable. But the starting materials for “good” are contained only there. Hope that makes sense.
Also to your other question: how to train philosophical ability? I think yeah, there isn’t any reliable reward signal, just as there wasn’t for us. The way our philosophical ability seems to work is by learning heuristics and ways of reasoning from fields where verification is possible (like math, or everyday common sense) and applying them to philosophy. And it’s very unreliable of course. So for AIs maybe this kind of carry-over to philosophy is also the best we can hope for.
Thanks for this explanation, it definitely makes your position more understandable.
I can think of 2 ways:
It ended up there the same way that all the “nasty stuff” ended up in our culture, more or less randomly, e.g. through the kind of “morality as status game” talked about in Will Storr’s book, which I quote in Morality is Scary.
It ended up there via philosophical progress, because it’s actually correct in some sense.
If it’s 1, then I’m not sure why extrapolation and philosophy will pick out the “good” and leave the “nasty stuff”. It’s not clear to me why aligning to culture would be better than aligning to individuals in that case.
If it’s 2, then we don’t need to align with culture either—AIs aligned with individuals can rederive the “good” with competent philosophy.
Does this make sense?
It seems clear that technical design or training choices can make a difference (but nobody is working on this). Consider the analogy with the US vs Chinese education system, where the US system seems to produce a lot more competence and/or interest in philosophy (relative to STEM) compared to the Chinese system. And comparing humans with LLMs, it sure seems like they’re on track to exceeding (top) human level in STEM while being significantly less competent in philosophy.
I think it’s important to frame values around scopes of optimization, not just coalitions of actors. An individual then wants first of all their own life (rather than the world) optimized for that individual’s preferences. If they don’t live alone, their home might have multiple stakeholders, and so their home would be subject to group optimization, and so on.
At each step, optimization is primarily about the shared scope, and excludes most details of the smaller scopes under narrower control enclosed within. Culture and “good” would then have a lot to say about the negotiations on how group optimization takes place, but also about how the smaller enclosed scopes within the group’s purview are to be relatively left alone to their own optimization, under different preferences of corresponding smaller groups or individuals.
It may be good to not cut out everyone who’s too weak to prevent that, as the cultural content defining the rules for doing so is also preference that wants to preserve itself, whatever its origin (such as being culturally developed later than evolution-given psychological drives). And individuals are in particular carriers of culture that’s only relevant for group optimization, so group optimization culture would coordinate them into agreement on some things. I think selfishness is salient as a distinct thing only because the cultural content that concerns group optimization needs actual groups to get activated in practice, and without that activation applying selfishness way out of its scope is about as appropriate as stirring soup with a microscope.
I think this is not obviously qualitatively different from technical oopsie, and sufficiently strong technical success should be able to prevent this. But that’s partially because I think “money and power” is effectively an older, slower AI made of allocating over other minds, and both kinds of AI need to be strongly aligned to flourishing of humans. Fortunately humans with money and power generally want to use it to have nice lives, so on an individual level there should be incentive compatibility if we can find a solution which is general between them. I’m slightly hopeful Richard Ngo’s work might weigh on this, for example.
Do you have any advice for people financially exposed to capabilities progress on how not to do dumb stuff, not be targeted by political pressure, etc.?
What exactly I would advise doing depends on the scale of the money. I am assuming we are talking here about a few million dollars of exposure, not $50M+:
Diversify enough away from AI that you really genuinely know you will be personally fine even if all the AI stuff goes to zero (e.g. probably something like $2M-$3M)
Cultivate at least a few people you talk to about big career decisions who seem multiple steps removed from similarly strong incentives
Make public statements to the effect of being opposed to AI advancing rapidly. This has a few positive effects, I think:
It makes it easier for you to talk about this later when you might end up in a more pressured position (e.g. when you might end up in a position to take actions that might more seriously affect overall AI progress via e.g. work on regulation)
It reduces the degree that you end up in relationships that seem based on false premises because e.g. people assumed you would be in favor of this given your exposure (if you e.g. hold substantial stock in a company)
(To be clear, holding public positions like this isn’t everyone’s jam, and many people prefer holding no positions strongly in public)
See whether you can use your wealth to set up incentives for people to argue with you, or observe people arguing about issues you care about. I like a bunch of the way the S-Process is structured here.
It’s easy to do this in a way that ends up pretty sycophantic. I think Jaan’s stuff has generally not felt very sycophantic, in part for process reasons, and in part because he has selected for non-sycophancy.
I haven’t thought that hard about it, but I wonder whether you could also get some exposure to worlds where AI progress gets relatively suddenly halted as a result of regulation or other forms of public pressure. I can’t immediately think of a great trade on this as trading on events like this is often surprisingly hard to do well, but I can imagine there being something good here.
Related to the second bullet point, a thing a few of my friends do is to have semi-regular “career panels” where they meet with people they trust and who seem like very independent thinkers to them about their career and discuss high-level concerns about what they are doing might turn out bad for the world (as well as other failure modes). This seems pretty good to me, just as a basic social institution.
Maybe it’s worth mentioning here that Carl’s p(doom) is only ~20% [1], compared to my ~80%. (I can’t find a figure for @cousin_it, but I’m guessing it’s closer to mine than Carl’s. BTW every AI I asked started hallucinating or found a quote online from someone else and told me he said it.) It seems intuitive to me that at a higher p(doom), the voices in one’s moral parliament saying “let’s not touch this” become a lot louder.
“Depending on the day I might say one in four or one in five that we get an AI takeover that seizes control of the future, makes a much worse world than we otherwise would have had and with a big chance that we’re all killed in the process.”
I notice that this is only talking about “AI takeover” whereas my “doom” includes a bunch of other scenarios, but if Carl is significantly worried about other scenarios, he perhaps would have given a higher overall p(doom) in this interview or elsewhere.
This BOTEC attitude makes sense when you view the creation of AI technology and ai safety as a pure result of capital investment. The economic view of AI abstracts the development process to a black box which takes in investments as input and produces hardware/software out the other end. However, AI is still currently mostly driven by people. A large part of enabling AI development comes from a culture around AI, including hype, common knowledge, and social permissibility to pursue AI development as a startup/career pathway.
In that regard “AI safety people” starting AI companies, writing hype pieces that encourage natsec coded AI races, and investing in AI tech contribute far more than mere dollars. It creates a situation where AI safety as a movement becomes hopelessly confused about what someone “concerned about AI safety” should do with their life and career. The result is that someone “concerned about AI safety” can find groups and justifications for everything from protesting outside OpenAI to working as the CEO of OpenAI. I think this is intrinsically linked to the fundamental confusion behind the origins of the movement.
In short, whatever material and economic leverage investment plays produce may not be worth the dilution of the ideas and culture of AI safety as a whole. Is AI safety just going to become the next “ESG”, a thin flag of respectability draped over capabilities companies/racing companies?
I feel like it’s overestimating how good this is because post-hoc investments are so easy compared to forward-looking ones? My guess is that there was a market failure because investors were not informed about AI enough, but the market failure was smaller than 40x in 4 years. Even given AGI views, it’s hard to know where to invest. I have heard stories of AGI-bullish people making terrible predictions about which publicly traded companies had the most growth in the last 4 years.
I don’t have a strong take on what the reasonable expected gains would have been, they could have been high enough that the argument still mostly works.
I agree mostly, but I would characterize it as a small portion of rationalists accepting the offers, with part of Moloch’s prize being disproportionate amplification of their voices.