This is Dr. Andrew Critch’s professional LessWrong account. Andrew is the CEO of Encultured AI, and works for ~1 day/week as a Research Scientist at the Center for Human-Compatible AI (CHAI) at UC Berkeley. He also spends around a ½ day per week volunteering for other projects like the Berkeley Existential Risk initiative and the Survival and Flourishing Fund. Andrew earned his Ph.D. in mathematics at UC Berkeley studying applications of algebraic geometry to machine learning models. During that time, he cofounded the Center for Applied Rationality and SPARC. Dr. Critch has been offered university faculty and research positions in mathematics, mathematical biosciences, and philosophy, worked as an algorithmic stock trader at Jane Street Capital’s New York City office, and as a Research Fellow at the Machine Intelligence Research Institute. His current research interests include logical uncertainty, open source game theory, and mitigating race dynamics between companies and nations in AI development.
Andrew_Critch(Andrew Critch)
Some AI research areas and their relevance to existential safety
This post reminds me of thinking from 1950s when people taking inspiration from Wiener’s work on cybernetics tried to operationalize “purposeful behavior” in terms of robust convergence to a goal state:
> When an optimizing system deviates beyond its own rim, we say that it dies. An existential catastrophe is when the optimizing system of life on Earth moves beyond its own outer rim.
I appreciate the direct attention to this process as an important instance of optimization. The first talk I ever gave in the EECS department at UC Berkeley (to the full EECS faculty) included a diagram of Earth drifting out of the region of phase spare where humans would exist. Needless to say, I’d like to see more explicit consideration of this type of scenario.
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)
It sounds like you may be assuming that people will roll out a technology when its reliability meets a certain level X, so that raising reliability of AI systems has no or little effect on the reliability of deployed system (namely it will just be X).
Yes, this is more or less my assumption. I think slower progress on OODR will delay release dates of transformative tech much more than it will improve quality/safety on the eventual date of release.
A more plausible model is that deployment decisions will be based on many axes of quality, e.g. suppose you deploy when the sum of reliability and speed reaches some threshold Y. If that’s the case, then raising reliability will improve the reliability and decrease the speed of deployed systems. If you think that increasing the reliability of AI systems is good (e.g. because AI developers want their AI systems to have various socially desirable properties and are limited by their ability to robustly achieve those properties) then this would be good.
I’m not clear on what part of that picture you disagree with or if you think that this is just small relative to some other risks.
Thanks for asking; I do disagree with this! Think reliability is a strongly dominant factor in decisions deploying real-world technology, such that to me it feels roughly-correct to treat it as the only factor. In this way of thinking, which you rightly attribute to me, progress in OODR doesn’t improve reliability on deployment-day, it mostly just moves deployment-day a bit earlier in time.
That’s not to say I’m advocating being afraid of OODR research because it “shortens timelines”, only that I think contributions to OODR are not particularly directly valuable to humanity’s long-term fate. As the post emphasizes, if someone cares about existential safety and wants to deploy their professional ambition to reducing x-risk, I think OODR is of high educational value for them to learn about, and as such I would be against “censoring” it as a topic to be discussed here.
> Third, unless humanity collectively works very hard to maintain a degree of simplicity and legibility in the overall structure of society*, this “alignment revolution” will greatly complexify our environment to a point of much greater incomprehensibility and illegibility than even today’s world. This, in turn, will impoverish humanity’s collective ability to keep abreast of important international developments, as well as our ability to hold the international economy accountable for maintaining our happiness and existence.
One approach to this problem is to work to make it more likely that AI systems can adequately represent human interests in understanding and intervening on the structure of society. But this seems to be a single/single alignment problem (to whatever extent that existing humans currently try to maintain and influence our social structure, such that impairing their ability to do so is problematic at all) which you aren’t excited about.
Yes, you’ve correctly anticipated my view on this. Thanks for the very thoughtful reading!
To elaborate: I claim “turning up the volume” on everyone’s individual agency (by augmenting them with user-aligned systems) does not automatically make society overall healthier and better able to survive, and in fact it might just hasten progress toward an unhealthy or destructive outcome. To me, the way to avoid this is not to make the aligned systems even more aligned with their users, but to start “aligning” them with the rest of society. “Aligning” with society doesn’t just mean “serving” society, it means “fitting into it”, which means the AI system needs to have a particular structure (not just a particular optimization objective) that makes it able to exist and function safely inside a larger society. The desired structure involves features like being transparent, legibly beneficial, and legibly fair. Without those aspects, I think your AI system introduces a bunch of political instability and competitive pressure into the world (e.g., fighting over disagreements about what it’s doing or whether it’s fair or whether it will be good), which I think by default turns up the knob on x-risk rather than turning it down. For a few stories somewhat-resembling this claim, see my next post:
Of course, if you make a super-aligned self-modifying AI, it might immediately self-modify so that its structure is more legibly beneficial and fair, because of the necessity (if I’m correct) of having that structure for benefitting society and therefore its creators/users. However, my preferred approach to building societally-compatible AI is not to make societally-incompatible AI systems and hope that they know their users “want” them to transform into more societally-compatible systems. I think we should build highly societally-compatible systems to begin with, not just because it seems broadly “healthier”, but because I think it’s necessary for getting existential risk down to tolerable levels like <3% or <1%. Moreover, because this view seems misunderstood by x-safety enthusiasts, I currently put the plurality of my existential-failure probability on outcomes arising from problems other than individual systems being misaligned (in terms of the objective) with the users or creators. Dafoe et al would call this “structural risk”, which I find to be a helpful framing that should be applied not only to the structure of society external to the AI system, but also the system’s internal structure.
My actual thought process for believing GDPR is good is not that it “is a sample from the empirical distribution of governance demands”, but that it intializes the process of governments (and thereby the public they represent) weighing in on what tech companies can and cannot design their systems to reason about, and more specifically the degree to which systems are allowed to reason about humans. Having a regulatory structure in place for restricting access to human data is a good first step, but we’ll probably also eventually want restrictions for how the systems process the data once they have it (e.g., they probably shouldn’t be allowed to use what data they have to come up with ways to significantly deceive or manipulate users).
I’ll say the same thing about fairness, in that I value having initialized the process of thinking about it not because it is in the “empirical distribution of governance demands”, but because it’s a useful governance demand. When things are more fair, people fight less, which is better/safer. I don’t mind much that existing fairness research hasn’t converged on what I consider “optimal fairness”, because I think that consideration is dwarfed by the fact that technical AI researchers are thinking about fairness at all.
That said, while I disagree with your analysis, I do agree with your final position:
I hope that technical AI x-risk/existential safety researchers focus on legitimizing and fulfilling those governance and accountability demands that are in fact legitimate.
I hope that discussion of AI governance and accountability does not inhabit a frame in which demands for governance and accountability are reliably legitimate.
The OP’s conclusion seems to be that social AI alignment should be the main focus. Personally, I’m less convinced. It would be interesting to see more detailed arguments about the above parameters that support or refute this thesis.
Thanks for the feedback, Vanessa. I’ve just written a follow-up post to better illustrate a class of societal-scale failure modes (“unsafe robust agent-agnostic processes”) that constitutes the majority of the probability mass I currently place on human extinction precipitated by transformative AI advancements (especially AGI, and/or high-level machine intelligence in the language of Grade et al). Here it is:
I’d be curious to see if it convinces you that what you call “social alignment” should be our main focus, or at least a much greater focus than currently.
Good to hear!
If I read that term [“AI existential safety”] without a definition I would assume it meant “reducing the existential risk posed by AI.” Hopefully you’d be OK with that reading. I’m not sure if you are trying to subtly distinguish it from Nick’s definition of existential risk or if the definition you give is just intended to be somewhere in that space of what people mean when they say “existential risk” (e.g. the LW definition is like yours).
Yep, that’s my intention. If given the chance I’d also shift the meaning of “existential risk” a bit away from Bostrom’s and a bit toward a more naive meaning of the term, but that’s a separate objective :) Specifically, if I got to rewrite Nick’s terminology (which might be too late now that it’s on Wikipedia), I’d say “existential risk” should mean “risk to the existence of humanity” and “existential-level risk” should mean “risks that are as morally significant as risks to the existence of humanity” (which, roughly speaking, is what Bostrom currently calls “existential risk”).
I hadn’t read it (nor almost any science fiction books/stories) but yes, you’re right! I’ve now added a callback to Autofac after the “facotiral DAO” story. Thanks.
These management assistants, DAOs etc are not aligned to the goals of their respective, individual users/owners.
How are you inferring this? From the fact that a negative outcome eventually obtained? Or from particular misaligned decisions each system made? It would be helpful if you could point to a particular single-agent decision in one of the stories that you view as evidence of that single agent being highly misaligned with its user or creator. I can then reply with how I envision that decision being made even with high single-agent alignment.
Maybe several AI systems aligned to different users with different interests can interact in a Pareto inefficient way (a tragedy of the commons among the AIs), and maybe this can be prevented by designing the AIs in particular ways.
Yes, this^.
> The objective of each company in the production web could loosely be described as “maximizing production″ within its industry sector.
Why does any company have this goal, or even roughly this goal, if they are aligned with their shareholders?
It seems to me you are using the word “alignment” as a boolean, whereas I’m using it to refer to either a scalar (“how aligned is the system?”) or a process (“the system has been aligned, i.e., has undergone a process of increasing its alignment”). I prefer the scalar/process usage, because it seems to me that people who do alignment research (including yourself) are going to produce ways of increasing the “alignment scalar”, rather than ways of guaranteeing the “perfect alignment” boolean. (I sometimes use “misaligned” as a boolean due to it being easier for people to agree on what is “misaligned” than what is “aligned”.) In general, I think it’s very unsafe to pretend numbers that are very close to 1 are exactly 1, because e.g., 1^(10^6) = 1 whereas 0.9999^(10^6) very much isn’t 1, and the way you use the word “aligned” seems unsafe to me in this way.
(Perhaps you believe in some kind of basin of convergence around perfect alignment that causes sufficiently-well-aligned systems to converge on perfect alignment, in which case it might make sense to use “aligned” to mean “inside the convergence basin of perfect alignment”. However, I’m both dubious of the width of that basin, and dubious that its definition is adequately social-context-independent [e.g., independent of the bargaining stances of other stakeholders], so I’m back to not really believing in a useful Boolean notion of alignment, only scalar alignment.)
In any case, I agree profit maximization it not a perfectly aligned goal for a company, however, it is a myopically pursued goal in a tragedy of the commons resulting from a failure to agree (as you point out) on something better to do (e.g., reducing competitive pressures to maximize profits).
I guess this is probably just a gloss you are putting on the combined behavior of multiple systems, but you kind of take it for given rather than highlighting it as a serious bargaining failure amongst the machines, and more importantly you don’t really say how or why this would happen.
I agree that it is a bargaining failure if everyone ends up participating in a system that everyone thinks is bad; I thought that would be an obvious reading of the stories, but apparently it wasn’t! Sorry about that. I meant to indicate this with the pointers to Dafoe’s work on “Cooperative AI” and Scott Alexander’s “Moloch” concept, but looking back it would have been a lot clearer for me to just write “bargaining failure” or “bargaining non-starter” at more points in the story.
The implicit argument seems to apply just as well to humans trading with each other and I’m not sure why the story is different if we replace the humans with aligned AI. [...] Maybe you think we are already losing sight of our basic goals and collectively pursuing alien goals
Yes, you understand me here. I’m not (yet?) in the camp that we humans have “mostly” lost sight of our basic goals, but I do feel we are on a slippery slope in that regard. Certainly many people feel “used” by employers/ institutions in ways that are disconnected with their values. People with more job options feel less this way, because they choose jobs that don’t feel like that, but I think we are a minority in having that choice.
> However, their true objectives are actually large and opaque networks of parameters that were tuned and trained to yield productive business practices during the early days of the management assistant software boom.
This sounds like directly saying that firms are misaligned.
I would have said “imperfectly aligned”, but I’m happy to conform to “misaligned” for this.
I agree that competitive pressures to produce imply that firms do a lot of producing and saving, just as it implies that humans do a lot of producing and saving.
Good, it seems we are synced on that.
And in the limit you can basically predict what all the machines do, namely maximally efficient investment.
Yes, it seems we are synced on this as well. Personally, I find this limit to be a major departure from human values, and in particular, it is not consistent with human existence.
But that doesn’t say anything about what the society does with the ultimate proceeds from that investment.
The attractor I’m pointing at with the Production Web is that entities with no plan for what to do with resources—other than “acquire more resources”—have a tendency to win out competitively over entities with non-instrumental terminal values like “humans having good relationships with their children”. I agree it will be a collective bargaining failure on the part of humanity if we fail to stop our own replacement by “maximally efficient investment” machines with no plans for what to do with their investments other than more investment. I think the difference between mine and your views here is that I think we are on track to collectively fail in that bargaining problem absent significant and novel progress on “AI bargaining” (which involves a lot of fairness/transparency) and the like, whereas I guess you think we are on track to succeed?
You might say: investment has to converge to 100% since people with lower levels of investment get outcompeted.
Yep!
But this it seems like the actual efficiency loss required to preserve human values seems very small even over cosmological time (e.g. see Carl on exactly this question).
I agree, but I don’t think this means we are on track to keeping the humans, and if we are on track in my opinion it will be mostly-because of (say, using Shapley value to define “mostly because of”) of technical progress on bargaining/cooperation/governance solutions rather than alignment solutions.
And more pragmatically, such competition most obviously causes harm either via a space race and insecure property rights,
I agree; competition causing harm is key to my vision of how things will go, so this doesn’t read to me as a counterpoint; I’m not sure if it was intended as one though?
or war between blocs with higher and lower savings rates
+1 to this as a concern; I didn’t realize other people were thinking about this, so good to know.
(some of them too low to support human life, which even if you don’t buy Carl’s argument is really still quite low, conferring a tiny advantage)
I think I disagree with you on the tininess of the advantage conferred by ignoring human values early on during a multi-polar take-off. I agree the long-run cost of supporting humans is tiny, but I’m trying to highlight a dynamic where fairly myopic/nihilistic power-maximizing entities end up quickly out-competing entities with other values, due to, as you say, bargaining failure on the part of the creators of the power-maximizing entities.
Why wouldn’t an aligned CEO sit down with the board to discuss the situation openly with them?
In the failure scenario as I envision it, the board will have already granted permission to the automated CEO to act much more quickly in order to remain competitive, such that the AutoCEO isn’t checking in with the Board enough to have these conversations. The AutoCEO is highly aligned with the Board in that it is following their instruction to go much faster, but in doing so it makes a larger number of tradeoff that the Board wishes they didn’t have to make. The pressure to do this results from a bargaining failure between the Board and other Boards who are doing the same thing and wishing everyone would slow down and do things more carefully and with more coordination/bargaining/agreement.
Can you explain the decisions an individual aligned CEO makes as its company stops benefiting humanity? I can think of a few options:
Actually the CEOs aren’t aligned at this point. They were aligned but then aligned CEOs ultimately delegated to unaligned CEOs. But then I agree with Vanessa’s comment.
The CEOs want to benefit humanity but if they do things that benefit humanity they will be outcompeted. so they need to mostly invest in remaining competitive, and accept smaller and smaller benefits to humanity. But in that case can you describe what tradeoff concretely they are making, and in particular why they can’t continue to take more or less the same actions to accumulate resources while remaining responsive to shareholder desires about how to use those resources?
Yes, it seems this is a good thing to hone in on. As I envision the scenario, the automated CEO is highly aligned to the point of keeping the Board locally happy with its decisions conditional on the competitive environment, but not perfectly aligned, and not automatically successful at bargaining with other companies as a result of its high alignment. (I’m not sure whether to say “aligned” or “misaligned” in your boolean-alignment-parlance.) At first the auto-CEO and the Board are having “alignment check-ins” where the auto-CEO meets with the Board and they give it input to keep it (even) more aligned than it would be without the check-ins. But eventually the Board realizes this “slow and bureaucratic check-in process” is making their company sluggish and uncompetitive, so they instruct the auto-CEO more and more to act without alignment check ins. The auto-CEO might warns them that this will decrease its overall level of per-decision alignment with them, but they say “Do it anyway; done is better than perfect” or something along those lines. All Boards wish other Boards would stop doing this, but neither they nor their CEOs manage to strike up a bargain with the rest of the world stop it. This concession by the Board—a result of failed or non-existent bargaining with other Boards [see: antitrust law]—makes the whole company less aligned with human values.
The win scenario is, of course, a bargain to stop that! Which is why I think research and discourse regarding how the bargaining will work is very high value on the margin. In other words, my position is that the best way for a marginal deep-thinking researcher to reduce the risks of these tradeoffs is not to add another brain to the task of making it easier/cheaper/faster to do alignment (which I admit would make the trade-off less tempting for the companies), but to add such a researcher to the problem of solving the bargaining/cooperation/mutual-governance problem that AI-enhanced companies (and/or countries) will be facing.
If trillion-dollar tech companies stop trying to make their systems do what they want, I will update that marginal deep-thinking researchers should allocate themselves to making alignment (the scalar!) cheaper/easier/better instead of making bargaining/cooperation/mutual-governance cheaper/easier/better. I just don’t see that happening given the structure of today’s global economy and tech industry.
Somehow the machine interests (e.g. building new factories, supplying electricity, etc.) are still being served. If the individual machines are aligned, and food/oxygen/etc. are in desperately short supply, then you might think an aligned AI would put the same effort into securing resources critical to human survival. Can you explain concretely what it looks like when that fails?
Yes, thanks for the question. I’m going to read your usage of “aligned” to mean “perfectly-or-extremely-well aligned with humans”. In my model, by this point in the story, there has a been a gradual decrease in the scalar level of alignment of the machines with human values, due to bargaining successes on simpler objectives (e.g., «maximizing production») and bargaining failures on more complex objectives (e.g., «safeguarding human values») or objectives that trade off against production (e.g., «ensuring humans exist»). Each individual principal (e.g., Board of Directors) endorsed the gradual slipping-away of alignment-scalar (or failure to improve alignment-scalar), but wished everyone else would stop allowing the slippage.
I don’t understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research.
I don’t mean to say this post warrants a new kind of AI alignment research, and I don’t think I said that, but perhaps I’m missing some kind of subtext I’m inadvertently sending?
I would say this post warrants research on multi-agent RL and/or AI social choice and/or fairness and/or transparency, none of which are “new kinds” of research (I promoted them heavily in my preceding post), and none of which I would call “alignment research” (though I’ll respect your decision to call all these topics “alignment” if you consider them that).
I would say, and I did say:
directing more x-risk-oriented AI research attention toward understanding RAAPs and how to make them safe to humanity seems prudent and perhaps necessary to ensure the existential safety of AI technology. Since researchers in multi-agent systems and multi-agent RL already think about RAAPs implicitly, these areas present a promising space for x-risk oriented AI researchers to begin thinking about and learning from.
I do hope that the RAAP concept can serve as a handle for noticing structure in multi-agent systems, but again I don’t consider this a “new kind of research”, only an important/necessary/neglected kind of research for the purposes of existential safety. Apologies if I seemed more revolutionary than intended. Perhaps it’s uncommon to take a strong position of the form “X is necessary/important/neglected for human survival” without also saying “X is a fundamentally new type of thinking that no one has done before”, but that is indeed my stance for X {a variety of non-alignment AI research areas}.
Paul, thanks writing this; it’s very much in line with the kind of future I’m most worried about.
For me, it would be super helpful if you could pepper throughout the story mentions of the term “outer alignment” indicating which events-in-particular you consider outer alignment failures. Is there any chance you could edit it to add in such mentions? E.g., I currently can’t tell if by “outer alignment failure” you’re referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I’d like to sync with your usage of the concept if possible (or at least know how to sync with it).
(I called the story an “outer” misalignment story because it focuses on the—somewhat improbable—case in which the intentions of the machines are all natural generalizations of their training objectives. I don’t have a precise definition of inner or outer alignment and think they are even less well defined than intent alignment in general, but sometimes the meaning seems unambiguous and it seemed worth flagging specifically because I consider that one of the least realistic parts of this story.)
Thanks; this was somewhat helpful to my understanding, because as I said,
> I currently can’t tell if by “outer alignment failure” you’re referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I’d like to sync with your usage of the concept if possible (or at least know how to sync with it).
I realize you don’t have a precise meaning of outer misalignment in mind, but in my opinion, confusion around this concept is central to (in my opinion) confused expectation that “alignment solutions” are adequate (on the technological side) for averting AI x-risk.
My question: Are you up for making your thinking and/or explaining about outer misalignment a bit more narratively precise here? E.g., could you say something like “«machine X» in the story is outer-misaligned because «reason»”?
Why I’m asking: My suspicion is that you answering this will help me pin down one of several possible substantive assumptions you and many other alignment-enthusiasts are making about the goals of AI designers operating in a multi-agent system or multi-polar singularity. Indeed, the definition of outer alignment currently endorsed by this forum is:
Outer Alignment in the context of machine learning is the property where the specified loss function is aligned with the intended goal of its designers. This is an intuitive notion, in part because human intentions are themselves not well-understood.
It’s conceivable to me that making future narratives much more specific regarding the intended goals of AI designers—and how they are or are not being violated—will either (a) clarify the problems I see with anticipating “alignment” solutions to be technically-adequate for existential safety, or (b) rescue the “alignment” concept with a clearer definition of outer alignment that makes sense in multi-agent systems.
So: thanks if you’ll consider my question!
Carl, thanks for this clear statement of your beliefs. It sounds like you’re saying (among other things) that American and Chinese cultures will not engage in a “race-to-the-bottom” in terms of how much they displace human control over the AI technologies their companies develop. Is that right? If so, could you give me a % confidence on that position somehow? And if not, could you clarify?
To reciprocate: I currently assign a ≥10% chance of a race-to-the-bottom on AI control/security/safety between two or more cultures this century, i.e., I’d bid 10% to buy in a prediction market on this claim if it were settlable. In more detail, I assign a ≥10% chance to a scenario where two or more cultures each progressively diminish the degree of control they exercise over their tech, and the safety of the economic activities of that tech to human existence, until an involuntary human extinction event. (By comparison, I assign at most around a ~3% chance of a unipolar “world takeover” event, i.e., I’d sell at 3%.)
I should add that my numbers for both of those outcomes are down significantly from ~3 years ago due to cultural progress in CS/AI (see this ACM blog post) allowing more discussion of (and hence preparation for) negative outcomes, and government pressures to regulate the tech industry.
My best understanding of your position is: “Sure, but they will be trying really hard. So additional researchers working on the problem won’t much change their probability of success, and you should instead work on more-neglected problems.”
That is not my position if “you” in the story is “you, Paul Christiano” :) The closest position I have to that one is : “If another Paul comes along who cares about x-risk, they’ll have more positive impact by focusing on multi-agent and multi-stakeholder issues or ‘ethics’ with AI tech than if they focus on intent alignment, because multi-agent and multi-stakeholder dynamics will greatly affect what strategies AI stakeholders ‘want’ their AI systems to pursue.”
If they tried to get you to quit working on alignment, I’d say “No, the tech companies still need people working on alignment for them, and Paul is/was one of those people. I don’t endorse converting existing alignment researchers to working on multi/multi delegation theory (unless they’re naturally interested in it), but if a marginal AI-capabilities-bound researcher comes along, I endorse getting them set up to think about multi/multi delegation more than alignment.”
I think that the biggest difference between us is that I think that working on single-single alignment is the easiest way to make headway on that issue, whereas you expect greater improvements from some categories of technical work on coordination
Yes.
(my sense is that I’m quite skeptical about most of the particular kinds of work you advocate
That is also my sense, and a major reason I suspect multi/multi delegation dynamics will remain neglected among x-risk oriented researchers for the next 3-5 years at least.
If you disagree, then I expect the main disagreement is about those other sources of overhead
Yes, I think coordination costs will by default pose a high overhead cost to preserving human values among systems with the potential to race to the bottom on how much they preserve human values.
> I think I disagree with you on the tininess of the advantage conferred by ignoring human values early on during a multi-polar take-off. I agree the long-run cost of supporting humans is tiny, but I’m trying to highlight a dynamic where fairly myopic/nihilistic power-maximizing entities end up quickly out-competing entities with other values, due to, as you say, bargaining failure on the part of the creators of the power-maximizing entities.
Could you explain the advantage you are imagining?
Yes. Imagine two competing cultures A and B have transformative AI tech. Both are aiming to preserve human values, but within A, a subculture A’ develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values. The shift is by design subtle enough not to trigger leaders of A and B to have a bargaining meeting to regulate against A’ (contrary to Carl’s narrative where leaders coordinate against loss of control). Subculture A’ comes to dominate discourse and cultural narratives in A, and makes A faster/more productive than B, such as through the development of fully automated companies as in one of the Production Web stories. The resulting advantage of A is enough for A to begin dominating or at least threatening B geopolitically, but by that time leaders in A have little power to squash A’, so instead B follows suit by allowing a highly automation-oriented subculture B’s to develop. These advantages are small enough not to trigger regulatory oversight, but when integrated over time they are not “tiny”. This results in the gradual empowerment of humans who are misaligned with preserving human existence, until those humans also lose control of their own existence, perhaps willfully, or perhaps carelessly, or through a mix of both.
Here, the members of subculture A’ are misaligned with preserving the existence of humanity, but their tech is aligned with them.
The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.
Yes, I agree with this.
A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or working on coordination? I’ll note that my best guess is that for any given person, this effect is minor relative to “which of the two topics is the person more interested in?”, so it doesn’t seem hugely important to me.
Yes! +10 to this! For some reason when I express opinions of the form “Alignment isn’t the most valuable thing on the margin”, alignment-oriented folks (e.g., Paul here) seem to think I’m saying you shouldn’t work on alignment (which I’m not), which triggers a “Yes, this is the most valuable thing” reply. I’m trying to say “Hey, if you care about AI x-risk, alignment isn’t the only game in town”, and staking some personal reputation points to push against the status quo where almost-everyone x-risk oriented will work on alignment almost-nobody x-risk-oriented will work on cooperation/coordination or multi/multi delegation.
Perhaps I should start saying “Guys, can we encourage folks to work on both issues please, so that people who care about x-risk have more ways to show up and professionally matter?”, and maybe that will trigger less pushback of the form “No, alignment is the most important thing”…
Thanks for this synopsis of your impressions, and +1 to the two points you think we agree on.
I also read the post as also implying or suggesting some things I’d disagree with:
As for these, some of them are real positions I hold, while some are not:
That there is some real sense in which “cooperation itself is the problem.”
I don’t hold that view. I the closest view I hold is more like: “Failing to cooperate on alignment is the problem, and solving it involves being both good at cooperation and good at alignment.”
Relatedly, that cooperation plays a qualitatively different role than other kinds of cognitive enhancement or institutional improvement.
I don’t hold the view you attribute to me here, and I agree wholesale with the following position, including your comparisons of cooperation with brain enhancement and improving belief accuracy:
I think that both cooperative improvements and cognitive enhancement operate by improving people’s ability to confront problems, and both of them have the downside that they also accelerate the arrival of many of our future problems (most of which are driven by human activity). My current sense is that cooperation has a better tradeoff than some forms of enhancement (e.g. giving humans bigger brains) and worse than others (e.g. improving the accuracy of people’s and institution’s beliefs about the world).
… with one caveat: some beliefs are self-fulfilling, such as cooperation/defection. There are ways of improving belief accuracy that favor defection, and ways that favor cooperation. Plausibly to me, the ways of improving belief accuracy that favor defection are worse that mo accuracy improvement at all. I’m particularly firm in this view, though; it’s more of a hedge.
That the nature of the coordination problem for AI systems is qualitatively different from the problem for humans, or somehow is tied up with existential risk from AI in a distinctive way. I think that the coordination problem amongst reasonably-aligned AI systems is very similar to coordination problems amongst humans, and that interventions that improve coordination amongst existing humans and institutions (and research that engages in detail with the nature of existing coordination challenges) are generally more valuable than e.g. work in multi-agent RL or computational social choice.
I do hold this view! Particularly the bolded part. I also agree with the bolded parts of your counterpoint, but I think you might be underestimating the value of technical work (e.g., CSC, MARL) directed at improving coordination amongst existing humans and human institutions.
I think blockchain tech is a good example of an already-mildly-transformative technology for implementing radically mutually transparent and cooperative strategies through smart contracts. Make no mistake: I’m not claiming blockchain tehc is going to “save the world”; rather, it’s changing the way people cooperate, and is doing so as a result of a technical insight. I think more technical insights are in order to improve cooperation and/or the global structure of society, and it’s worth spending research efforts to find them.
Reminder: this is not a bid for you personally to quit working on alignment!
That this story is consistent with your prior arguments for why single-single alignment has low (or even negative) value. For example, in this comment you wrote “reliability is a strongly dominant factor in decisions
indeploying real-world technology, such that to me it feels roughly-correctlyto treat it as the only factor.” But in this story people choose to adopt technologies that are less robustly aligned because they lead to more capabilities. This tradeoff has real costs even for the person deploying the AI (who is ultimately no longer able to actually receive any profits at all from the firms in which they are nominally a shareholder). So to me your story seems inconsistent with that position and with your prior argument. (Though I don’t actually disagree with the framing in this story, and I may simply not understand your prior position.)
My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens. In practice, I think the threshold by default will not be “Reliable enough to partake in a globally cooperative technosphere that preserves human existence”, but rather, “Reliable enough to optimize unilaterally for the benefits of the stakeholders of each system, i.e., to maintain or increase each stakeholder’s competitive advantage.” With that threshold, there easily arises a RAAP racing to the bottom on how much human control/safety/existence is left in the global economy. I think both purely-human interventions (e.g., talking with governments) and sociotechnical interventions (e.g., inventing cooperation-promoting tech) can improve that situation. This is not to say “cooperation is all you need”, any more than I than I would say “alignment is all you need”.
I may write more on this later, but for now I just want to express exuberance at someone in the x-risk space thinking and writing about this :)