Failure mode: When B-cultured entities invest in “having more influence”, often the easiest way to do this will be for them to invest in or copy A’-cultured-entities/processes. This increases the total presence of A’-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values. Moreover, the A’ culture has an incentive to trick the B culture(s) into thinking A’ will not take over the world, but eventually, A’ wins.
I’m wondering why the easiest way is to copy A’—why was A’ better at acquiring influence in the first place, so that copying them or investing in them is a dominant strategy? I think I agree that once you’re at that point, A’ has an advantage.
In other words, the humans and human-aligned institutions not collectively being good enough at cooperation/bargaining risks a slow slipping-away of hard-to-express values and an easy takeover of simple-to-express values (e.g., power-maximization).
This doesn’t feel like other words to me, it feels like a totally different claim.
Thanks for noticing whatever you think are the inconsistencies; if you have time, I’d love for you to point them out.
In the production web story it sounds like the web is made out of different firms competing for profit and influence with each other, rather than a set of firms that are willing to leave profit on the table to benefit one another since they all share the value of maximizing production. For example, you talk about how selection drives this dynamic, but the firm that succeed are those that maximize their own profits and influence (not those that are willing to leave profit on the table to benefit other firms).
So none of the concrete examples of Wei Dai’s economies of scale seem to actually seem to apply to give an advantage for the profit-maximizers in the production web. For example, natural monopolies in the production web wouldn’t charge each other marginal costs, they would charge profit-maximizing profits. And they won’t share infrastructure investments except by solving exactly the same bargaining problem as any other agents (since a firm that indiscriminately shared its infrastructure would get outcompeted). And so on.
Specifically, the subprocesses of each culture that are in charge of production-maximization end up cooperating really well with each other in a way that ends up collectively overwhelming the original (human) cultures.
This seems like a core claim (certainly if you are envisioning a scenario like the one Wei Dai describes), but I don’t yet understand why this happens.
Suppose that the US and China both both have productive widget-industries. You seem to be saying that their widget-industries can coordinate with each other to create lots of widgets, and they will do this more effectively than the US and China can coordinate with each other.
Could you give some concrete example of how the US widget industry and the Chinese widget industries coordinate with each other to make more widgets, and why this behavior is selected?
For example, you might think that the Chinese and US widget industry share their insights into how to make widgets (as the aligned actors do in Wei Dai’s story), and that this will cause widget-making to do better than other non-widget sectors where such coordination is not possible. But I don’t see why they would do that—the US firms that share their insights freely with Chinese firms do worse, and would be selected against in every relevant sense, relative to firms that attempt to effectively monetize their insights. But effectively monetizing their insights is exactly what the US widget industry should do in order to benefit the US. So I see no reason why the widget industry would be more prone to sharing its insights
So I don’t think that particular example works. I’m looking for an example of that form though, some concrete form of cooperation that the production-maximization subprocesses might engage in that allows them to overwhelm the original cultures, to give some indication for why you think this will happen in general.
For some reason when I express opinions of the form “Alignment isn’t the most valuable thing on the margin”, alignment-oriented folks (e.g., Paul here) seem to think I’m saying you shouldn’t work on alignment
In fairness, writing “marginal deep-thinking researchers [should not] allocate themselves to making alignment […] cheaper/easier/better” is pretty similar to saying “one shouldn’t work on alignment.”
(I didn’t read you as saying that Paul or Rohin shouldn’t work on alignment, and indeed I’d care much less about that than about a researcher at CHAI arguing that CHAI students shouldn’t work on alignment.)
On top of that, in your prior post you make stronger claims:
“Contributions to OODR research are not particularly helpful to existential safety in my opinion.”
“Contributions to preference learning are not particularly helpful to existential safety in my opinion”
“In any case, I see AI alignment in turn as having two main potential applications to existential safety:” (excluding the main channel Paul cares about and argues for, namely that making alignment easier improves the probability that the bulk of deployed ML systems are aligned and reduces the competitive advantage for misaligned agents)
In the current post you (mostly) didn’t make claims about the relative value of different areas, and so I was (mostly) objecting to arguments that I consider misleading or incorrect. But you appeared to be sticking with the claims from your prior post and so I still ascribed those views to you in a way that may have colored my responses.
maybe that will trigger less pushback of the form “No, alignment is the most important thing”…
I’m not really claiming that AI alignment is the most important thing to work on (though I do think it’s among the best ways to address problems posed by misaligned AI systems in particular). I’m generally supportive of and excited about a wide variety of approaches to improving society’s ability to cope with future challenges (though multi-agent RL or computational social choice would not be near the top of my personal list).
Failing to cooperate on alignment is the problem, and solving it involves being both good at cooperation and good at alignment
Sounds like we are on broadly the same page. I would have said “Aligning ML systems is more likely if we understand more about how to align ML systems, or are better at coordinating to differentially deploy aligned systems, or are wiser or smarter or...” and then moved on to talking about how alignment research quantitatively compares to improvements in various kinds of coordination or wisdom or whatever. (My bottom line from doing this exercise is that I feel more general capabilities typically look less cost-effective on alignment in particular, but benefit a ton from the diversity of problems they help address.)
My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens.
I don’t think we can get to convergence on many of these discussions, so I’m happy to just leave it here for the reader to think through.
Reminder: this is not a bid for you personally to quit working on alignment!
I’m reading this (and your prior post) as bids for junior researchers to shift what they focus on. My hope is that seeing the back-and-forth in the comments will, in expectation, help them decide better.
Both are aiming to preserve human values, but within A, a subculture A’ develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values.
I was asking you why you thought A’ would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future?
One obvious reason is single-single misalignment—A’ is willing to deploy misaligned AI in order to get an advantage, while B isn’t—but you say “their tech is aligned with them” so it sounds like you’re setting this aside. But maybe you mean that A’ has values that make alignment easy, while B has values that make alignment hard, and so B’s disadvantage still comes from single-single misalignment even though A″s systems are aligned?
Another advantage is that A’ can invest almost all of their resources, while B wants to spend some of their resources today to e.g. help presently-living humans flourish. But quantitatively that advantage doesn’t seem like it can cause A’ to dominate, since B can secure rapidly rising quality of life for all humans using only a small fraction of its initial endowment.
Wei Dai has suggested that groups with unified values might outcompete groups with heterogeneous values since homogeneous values allow for better coordination, and that AI may make this phenomenon more important. For example, if a research-producer and research-consumer have different values, then the producer may restrict access as part of an inefficient negotiation process and so they may be at a competitive disadvantage relative to a competing community where research is shared freely. This feels inconsistent with many of the things you are saying in your story, but I might be misunderstanding what you are saying and it could be that some argument like like Wei Dai’s is the best way to translate your concerns into my language.
My sense is that you have something else in mind. I included the last bullet point as a representative example to describe the kind of advantage I could imagine you thinking that A’ had.
I think that most likely either humans are killed incidentally as part of the sensor-hijacking (since that’s likely to be the easiest way to deal with them), or else AI systems reserve a negligible fraction of their resources to keep humans alive and happy (but disempowered) based on something like moral pluralism or being nice or acausal trade (e.g. the belief that much of their influence comes from the worlds in which they are simulated by humans who didn’t mess up alignment and who would be willing to exchange a small part of their resources in order to keep the people in the story alive and happy).
The main point of intervention in this scenario that stood out to me would be making sure that (during the paragraph beginning with “For many people this is a very scary situation.”) we at least attempt to use AI-negotiators to try to broker an international agreement to stop development of this technology until we understood it better (and using AI-designed systems for enforcement/surveillance). Is there anything in particular that makes this infeasible?
I don’t think this is infeasible. It’s not the intervention I’m most focused on, but it may be the easiest way to avoid this failure (and it’s an important channel for advance preparations to make things better / important payoff for understanding what’s up with alignment and correctly anticipating problems).
I understand the scenario say it isn’t because the demonstrations are incomprehensible
Yes, if demonstrations are comprehensible then I don’t think you need much explicit AI conflict to whistleblow since we will train some systems to explain risks to us.
The global camera grab must involve plans that aren’t clearly bad to humans even when all the potential gotchas are pointed out. For example they may involve dynamics that humans just don’t understand, or where a brute force simulation or experiment would be prohibitively expensive without leaps of intuition that machines can make but humans cannot. Maybe that’s about tiny machines behaving in complicated ways or being created covertly, or crazy complicated dynamics of interacting computer systems that humans can’t figure out. It might involve the construction of new AI-designed AI systems which operate in different ways whose function we can’t really constrain except by seeing predictions of their behavior from an even-greater distance (machines which are predicted to lead to good-looking outcomes, which have been able to exhibit failures to us if so-incentivized, but which are even harder to control).
(There is obviously a lot you could say about all the tools at the human’s disposal to circumvent this kind of problem.)
This is one of the big ways in which the story is more pessimistic than my default, and perhaps the highlighted assumptions rule out the most plausible failures, especially (i) multi-year takeoff, (ii) reasonable competence on behalf of the civilization, (iii) “correct” generalization.
Even under those assumptions I do expect events to eventually become incomprehensible in the necessary ways, but it feels more likely that there will be enough intervening time for ML systems to e.g. solve alignment or help us shift to a new world order or whatever. (And as I mention, in the worlds where the ML systems can’t solve alignment well enough in the intervening time, I do agree that it’s unlikely we can solve it in advance.)
I’m a bit surprised that the outcome is worse than you expect, considering that this scenario is “easy mode” for societal competence and inner alignment, which seem to me to be very important parts of the overall problem.
The main way it’s worse than I expect is that I expect future people to have a long (subjective) time to solve these problems and to make much more progress than they do in this story.
Am I right to infer that you think outer alignment is the bulk of the alignment problem, more difficult than inner alignment and societal competence?
I don’t think it’s right to infer much about my stance on inner vs outer alignment. I don’t know if it makes sense to split out “social competence” in this way.
In this story, there aren’t any major actual wars, just simulated wars / war games. Right? Why is that? I look at the historical base rate of wars, and my intuitive model adds to that by saying that during times of rapid technological change it’s more likely that various factions will get various advantages (or even just think they have advantages) that make them want to try something risky. OTOH we haven’t had major war for seventy years, and maybe that’s because of nukes + other factors, and maybe nukes + other factors will still persist through the period of takeoff?
The lack of a hot war in this story is mostly from the recent trend. There may be a hot war prior to things heating up, and then the “takeoff” part of the story is subjectively shorter than the last 70 years.
IDK, I worry that the reasons why we haven’t had war for seventy years may be largely luck / observer selection effects, and also separately even if that’s wrong
I’m extremely skeptical of an appeal to observer selection effects changing the bottom line about what we should infer from the last 70 years. Luck sounds fine though.
Relatedly, in this story the AIs seem to be mostly on the same team? What do you think is going on “under the hood” so to speak: Have they all coordinated (perhaps without even causally communicating) to cut the humans out of control of the future?
I don’t think the AI systems are all on the same team. That said, to the extent that there are “humans are deluded” outcomes that are generally preferable according to many AI’s values, I think the AIs will tend to bring about such outcomes. I don’t have a strong view on whether that involves explicit coordination. I do think the range for every-wins outcomes (amongst AIs) is larger because of the “AI’s generalize ‘correctly’” assumption, so this story probably feels a bit more like “us vs them” than a story that relaxed that assumption.
Why aren’t they fighting each other as well as the humans? Or maybe they do fight each other but you didn’t focus on that aspect of the story because it’s less relevant to us?
I think they are fighting each other all the time, though mostly in very prosaic ways (e.g. McDonald’s and Burger King’s marketing AIs are directly competing for customers). Are there some particular conflicts you imagine that are suppressed in the story?
I feel like when takeoff is that distributed, there will be at least some people/factions who create agenty AI systems that aren’t even as superficially aligned as the unaligned benchmark. They won’t even be trying to make things look good according to human judgment, much less augmented human judgment!
I’m imagining that’s the case in this story.
Failure is early enough in this story that e.g. the human’s investment in sensor networks and rare expensive audits isn’t slowing them down very much compared to the “rogue” AI.
Such “rogue” AI could provide a competitive pressure, but I think it’s a minority of the competitive pressure overall (and at any rate it has the same role/effect as the other competitive pressure described in this story).
Can you say more about how “the failure modes in this story are an important input into treachery?”
We will be deploying many systems to anticipate/prevent treachery. If we could stay “in the loop” in the sense that would be needed to survive this outer alignment story, then I think we would also be “in the loop” in roughly the sense needed to avoid treachery. (Though it’s not obvious in light of the possibility of civilization-wide cascading ML failures, and does depend on further technical questions about techniques for avoiding that kind of catastrophe.)
I currently can’t tell if by “outer alignment failure” you’re referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I’d like to sync with your usage of the concept if possible (or at least know how to sync with it).
I’m saying each individual machine is misaligned, because each individual machine is searching over plans to find one that leads to an outcome that humans will judge as good in hindsight. The collective behavior of many machines each individually trying make things look good in hindsight leads to an outcome where things look good in hindsight. All the machines achieve what they are trying to achieve (namely things look really good according to the judgments-in-hindsight), but humans are marginalized and don’t get what they want, and that’s consistent because no machines cared about humans getting what they want. This is not a story where some machines were trying to help humans but were frustrated by emergent properties of their interaction.
I realize you don’t have a precise meaning of outer misalignment in mind, but in my opinion, confusion around this concept is central to (in my opinion) confused expectation that “alignment solutions” are adequate (on the technological side) for averting AI x-risk.
I use “outer alignment” to refer to a step in some alignment approaches. It is a well-defined subproblem for some approaches (namely those that aim to implement a loss function that accurately reflects human preferences over system behavior, and then produce an aligned system by optimizing that loss function), and obviously inapplicable to some approaches, and kind of a fuzzy and vague subproblem of others.
It’s a bit weird to talk about a failure story as an “outer” alignment failure story, or to describe a general system acting in the world as “outer misaligned,” since most possible systems weren’t built by following an alignment methodology that admits a clean division into an “outer” and “inner” part.
I added the word “(outer)” in the title as a parenthetical to better flag the assumption about generalization mentioned in the appendix. I expected this flag to be meaningful for many readers here. If it’s not meaningful to you then I would suggest ignoring it.
If there’s anything useful to talk about in that space I think it’s the implicit assumption (made explicit in the first bullet of the appendix) about how systems generalize. Namely, you might think that a system that is trained to achieve outcomes that look good to a human will in fact be trying to do something quite different. I think there’s a pretty good chance of that, in which case this story would look different (because the ML systems would conspire to disempower humans much earlier in the story). However, it would still be the case that we fail because individual systems are trying to bring about failure.
confused expectation that “alignment solutions” are adequate (on the technological side) for averting AI x-risk.
Note that this isn’t my view about intent alignment. (Though it is true tautologically for people who define “alignment” as “the problem of building AI systems that produce good outcomes when run,” though as I’ve said I quite dislike that definition.)
I think there are many x-risks posed or exacerbated by AI progress beyond intent alignment problems . (Though I do think that intent alignment is sufficient to avoid e.g. the concern articulated in your production web story.)
It’s conceivable to me that making future narratives much more specific regarding the intended goals of AI designers
The people who design AI (and moreover the people who use AI) have a big messy range of things they want. They want to live happy lives, and to preserve their status in the world, and to be safe from violence, and to be respected by people they care about, and similar things for their children...
When they invest in companies, or buy products from companies, or try to pass laws, they do so as a means to those complicated ends. That is, they hope that in virtue of being a shareholder of a successful company (or whatever) they will be in a better position to achieve their desires in the future.
One axis of specificity is to say things about what exactly they are imagining getting out of their investments or purchases (which will inform lots of low level choices they make). For example: the shareholders expect this company to pay dividends into their bank accounts, and they expect to be able to use the money in their bank accounts to buy things they want in the future, and they expect that if the company is not doing a good job they will be able to vote to replace the CEO, and so on. Some of the particular things they imagine buying: real estate and news coverage and security services. If they purchase security services: they hope that those security services will keep them safe in some broad and intuitive sense. There are some components of that they can articulate easily (e.g. they don’t want to get shot) and some they can’t (e.g. they want to feel safe, they don’t want to be coerced, they want to retain as much flexibility as possible when using public facilities, etc.).
A second axis would be to break this down to the level of “single” AI systems, i.e. individual components which are optimized end-to-end. For example, one could enumerate the AI systems involved in running a factory or fighting a war or some other complex project. There are probably thousands of AI systems involved in each of those projects, but you could zoom in on some particular examples, e.g. what AI system is responsible for making decisions about the flight path of a particular drone, and the zoom in on one of the many AI systems involved in the choice to deploy that particular AI (and how to train it). We could talk about how of these individual AI systems trying to make things look good in hindsight (or instrumental subgoals thereof) result in bringing about an outcome that looks good in hindsight. (Though mostly I regard that as non-mysterious—if you have a bunch of AI systems trying to achieve X, or identifying intermediates Y that would tend to lead X and then deploying new AI to achieve Y, it’s clear enough how that can lead to X. I also agree that it can lead to non-X, but that doesn’t really happen in this story.)
A third axis would be to talk in more detail about exactly how a particular AI is constructed, e.g. over what time period is training data gathered from what sensors? How are simulated scenarios generated, when those are needed? What humans and other ML systems are involved in the actual evaluation of outcomes that is used to train and validate it?
For each of those three axes (and many others) it seems like there’s a ton of things one could try to specify more precisely. You could easily write a dozen pages about the training of a single AI system, or a dozen pages enumerating an overview of the AI systems involved in a single complex project, or a dozen pages describing the hopes and intentions of the humans interacting with a particular AI. So you have to be pretty picky about which you spell out.
My question: Are you up for making your thinking and/or explaining about outer misalignment a bit more narratively precise here? E.g., could you say something like “«machine X» in the story is outer-misaligned because «reason»”?
Do you mean explaining why I judge these systems to be misaligned (a), or explaining causally how it is that they became misaligned (b)?
For (a): I’m judging these systems to be misaligned because they take concrete actions that they can easily determine are contrary to what their operators want. Skimming my story again, here are the main concrete decisions that I would describe as obviously contrary to the user’s intentions:
The Ponzi scheme and factory that fabricates earnings reports understand that customers will be unhappy about this when they discover it several months in the future, yet they take those actions anyway. Although these failures are not particularly destructive on their own, they are provided as representative examples of a broader class of “alignment warning shots” that are happening and provide the justification for people deploying AI systems that avoid human disapproval over longer and longer time horizons.
The watchdogs who alternately scare or comfort us (based on what we asked for), with none of them explaining honestly what is going on, are misaligned. If we could build aligned systems, then those systems would sit down with us and talk about the risks and explain what’s up as best they can, they would explain the likely bad outcomes in which sensors are corrupted and how that corruption occurs, and they would advise on e.g. what policies would avoid that outcome.
The machines that build/deploy/defend sensor networks are misaligned, which is why they actively insert vulnerabilities that would be exploited by attackers who intend to “cooperate” and avoid creating an appearance of trouble. Those vulnerabilities are not what the humans want in any sense. Similarly, The defense system that allows invaders to take over a city as long as they participate in perpetuating an illusion of security are obviously misaligned.
The machines that actually hack cameras and seize datacenters are misaligned, because the humans don’t actually care about the cameras showing happy pictures or the datacenters recording good news. Machines were deployed to optimize those indicators because they can serve as useful proxies for “we are actually safe and happy.”
Most complex activities involve a large number of components, and I agree that these descriptions are still “mult-agent” in the sense that e.g. managing an investment portfolio involves multiple distinct AIs. (The only possible exception is the watchdog system.) But these outcomes obtain because individual ML components are trying to bring them about, and so it still makes sense to intervene on the motivations of individual components in order to avoid these bad outcomes.
For example, carrying out and concealing a Ponzi scheme involves many actions that are taken because they successfully conceal the deception (e.g. you need to organize a financial statement carefully to deflect attention from an auditor), by a particular machine (e.g. an automated report-preparation system which is anticipating the consequences of emitting different possible reports) which is trying to carry out that deception (in the sense of considering many possible actions and selecting those that successfully deceive), despite being able to predict that the user will ultimately say that this was contrary to their preferences.
(b): these systems became misaligned because they are an implementation of an algorithm (the “unaligned benchmark”) that seems unlikely to produce aligned systems. They were deployed because they were often useful despite their misalignment. They weren’t replaced by aligned versions because we didn’t know of any alternative algorithm that was similarly useful (and many unspecified alignment efforts have apparently failed). I do think we could have avoided this story in many different ways, and so you could highlight any of those as a causal factor (the story highlights none): we could have figured out how to build aligned systems, we could have anticipated the outcome and made deals to avoid it, more institutions could be managed by smarter or more forward-looking decision-makers, we could have a strong sufficiently competent world government, etc.
In this story, what is preventing humans from going collectively insane due to nations, political factions, or even individuals blasting AI-powered persuasion/propaganda at each other? (Maybe this is what you meant by “people yelling at each other”?)
It seems like the AI described in this story is still aligned enough to defend against AI-powered persuasion (i.e. by the time that AI is sophisticated enough to cause that kind of trouble, most people are not ever coming into contact with adversarial content)
Why don’t AI safety researchers try to leverage AI to improve AI alignment, for example implementing DEBATE and using that to further improve alignment, or just an adhoc informal version where you ask various AI advisors to come up with improved alignment schemes and to critique/defend each others’ ideas?
I think they do, but it’s not clear whether any of them change the main dynamic described in the post.
(My expectation is that we end up with one or multiple sequences of “improved” alignment schemes that eventually lock in wrong solutions to some philosophical or metaphilosophical problems, or has some other problem that is much subtler than the kind of outer alignment failure described here.)
I’d like to have a human society that is free to grow up in a way that looks good to humans, and which retains enough control to do whatever they decide is right down the line (while remaining safe and gradually expanding the resources available to them for continued growth). When push comes to shove I expect most people to strongly prefer that kind of hope (vs one that builds a kind of AI that will reach the right conclusions about everything), not on the basis of sophisticated explicit reasoning but because that’s the only path that can really grow out of the current trajectory in a way that’s not super locally super objectionable to lots of people, and so I’m focusing on people’s attempts and failures to construct such an AI.
I don’t know exactly what kind of failure you are imagining is locked in, that pre-empts or avoids the kind of failure described here. Maybe you think it doesn’t pre-empt this failure, but that you expect we probably can solve the immediate problem described in this post and then get screwed by a different problem down the line. If so, then I think I agree that this story is a little bit on the pessimistic side w.r.t. the immediate problem although I may disagree about how pessimistic about it is. (Though there’s still a potentially-larger disagreement about just how bad the situation is after solving that immediate problem.)
(You might leave great value on the table from e.g. not bargaining with the simulators early enough and so getting shut off, or not bargaining with each other before you learn facts that make them impossible and so permanently leaving value on the table, but this is not a story about that kind of failure and indeed those happen in parallel with the failure in this story.)
As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.
That’s basically where I start, but then I want to try to tell some story about why it kills you, i.e. what is it about the heuristic H and circumstance C that causes it to kill you?
I agree this involves discretion, and indeed moving beyond the trivial story “The algorithm fails and then it turns out you die” requires discretion, since those stories are certainly plausible. The other extreme would be to require us to keep making the story more and more concrete until we had fully specified the model, which also seems intractable. So instead I’m doing some in between thing, which is roughly like: I’m allowed to push on the story to make it more concrete along any axis, but I recognize that I won’t have time to pin down every axis so I’m basically only going to do this a bounded number of times before I have to admit that it seems plausible enough (so I can’t fill in a billion parameters of my model one by one this way; what’s worse, filling in those parameters would take even more than a billion time and so this may become intractable even before you get to a billion).
I’d say that every single machine in the story is misaligned, so hopefully that makes it easy :)
I’m basically always talking about intent alignment, as described in this post.
(I called the story an “outer” misalignment story because it focuses on the—somewhat improbable—case in which the intentions of the machines are all natural generalizations of their training objectives. I don’t have a precise definition of inner or outer alignment and think they are even less well defined than intent alignment in general, but sometimes the meaning seems unambiguous and it seemed worth flagging specifically because I consider that one of the least realistic parts of this story.)
In my other response to your comment I wrote:
I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn’t even take that long.
I guess SSH itself would be an interesting test of this, e.g. comparing the theoretical model of this paper to a modern implementation. What is your view about that comparison? e.g. how do you think about the following possibilities:
There is no material weakness in the security proof.
A material weakness is already known.
An interested layperson could find a material weakness with moderate effort.
An expert could find a material weakness with significant effort.
My guess would be that probably we’re in world 2, and if not that it’s probably because no one cares that much (e.g. because it’s obvious that there will be some material weakness and the standards of the field are such that it’s not publishable unless it actually comes with an attack) and we are in world 3.
(On a quick skim, and from the author’s language when describing the model, my guess is that material weaknesses of the model are more or less obvious and that the authors are aware of potential attacks not covered by their model.)
Why did you write “This post [Inaccessible Information] doesn’t reflect me becoming more pessimistic about iterated amplification or alignment overall.” just one month before publishing “Learning the prior”? (Is it because you were classifying “learning the prior” / imitative generalization under “iterated amplification” and now you consider it a different algorithm?)
I think that post is basically talking about the same kinds of hard cases as in Towards Formalizing Universality 1.5 years earlier (in section IV), so it’s intended to be more about clarification/exposition than changing views.
See the thread with Rohin above for some rough history.
Why doesn’t the analogy with cryptography make you a lot more pessimistic about AI alignment, as it did for me?
I’m not sure.It’s possible I would become more pessimistic if I walked through concrete cases of people’s analyses being wrong in subtle and surprising ways.
My experience with practical systems is that it is usually easy for theorists to describe hypothetical breaks for the security model, and the issue is mostly one of prioritization (since people normally don’t care too much about security). For example, my strong expectation would be that people had described hypothetical attacks on any of the systems discussed in the article you linked prior to their implementation, at least if they had ever been subject to formal scrutiny. The failures are just quite far away from the levels of paranoia that I’ve seen people on the theory side exhibit when they are trying to think of attacks.
I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn’t even take that long. It sounds like a fun game.
Another possible divergence is that I’m less convinced by the analogy, since alignment seems more about avoiding the introduction of adversarial consequentialists and it’s not clear if that game behaves in the same way. I’m not sure if that’s more or less important than the prior point.
Would you do anything else to make sure it’s safe, before letting it become potentially superintelligent? For example would you want to see “alignment proofs” similar to “security proofs” in cryptography?
I would want to do a lot of work before deploying an algorithm in any context where a failure would be catastrophic (though “before letting it become potentially superintelligent” kind of suggests a development model I’m not on board with).
That would ideally involve theoretical analysis from a lot of angles, e.g. proofs of key properties that are amenable to proof, demonstrations of how the system could plausibly fail if we were wrong about key claims or if we relax assumptions, and so on.
It would also involve good empirical characterization, including things like running on red team inputs, or changing the training procedure in ways that seem as bad as possible while still preserving our alignment arguments, and performing extensive evals under those more pessimistic conditions. It would involve validating key claims individually, and empirically testing other claims that are established by structurally similar arguments. It would involve characterizing scaling behavior where applicable and understanding it as well as we can (along with typical levels of variability and plausible stories about deviations from trend).
What if such things do not seem feasible or you can’t reach very high confidence that the definitions/assumptions/proofs are correct?
I’m not exactly sure what you are asking. It seems like we’ll do what we can on all the fronts and prioritize them as well as we can. Do you mean, what else can we say today about what methodologies we’d use? Or under what conditions would I pivot to spending down my political capital to delay deployment? Or something else?
rom my perspective, there is a core reason for worry, which is something like “you can’t fully control what patterns of thought your algorithm learn, and how they’ll behave in new circumstances”, and it feels like you could always apply that as your step 2
That doesn’t seem like it has quite the type signature I’m looking for. I’m imagining a story as a description of how something bad happens, so I want the story to end with “and then something bad happens.”
In some sense you could start from the trivial story “Your algorithm didn’t work and then something bad happened.” Then the “search for stories” step is really just trying to figure out if the trivial story is plausible. I think that’s pretty similar to a story like: “You can’t control what your model thinks, so in some new situation it decides to kill you.”
I’m mostly doing that by making it more and more concrete—something is plausible iff there is a plausible way to fill in all the details. E.g. how is the model thinking, and why does that lead it to decide to kill you?
Sometimes after filling in a few details I’ll see that the current story isn’t actually plausible after all (i.e. now I see how to argue that the details-so-far are contradictory). In that case I backtrack.
Sometimes I fill in enough details that I’m fairly convinced the story is plausible, i.e. that there is some way to fill in the rest of the details that’s consistent with everything I know about the world. In that case I try to come up with a new algorithm or new assumption.
(Sometimes plausibility takes the form of an argument that there is a way to fill in some set of details, e.g. maybe there’s an argument that a big enough model could certainly compute X . Or sometimes I’m just pretty convinced for heuristic reasons.)
That’s not a fully-precise methodology. But it’s roughly what I’d do. (There are many places where the the methodology in this post is not fully-precise and certainly not mechanical.)
If I was starting looking at the trivial story “and then your algorithm kills you,” my first move would usually be to try to say what kind of model was learned, which needs to behave well on the training set and plausibly kill you off distribution. Then I might try to shoot that story down by showing that some other model behaves even better on the training set or is even more readily learned (to try to contradict the part where the story needed to claim “And this was the model learned by SGD”), then gradually filling in more details as necessary to evaluate plausibility of the story.
I think the upshot of those technologies (and similarly for ML assistants) is:
It takes longer before you actually face a catastrophe.
In that time, you can make faster progress towards an “out”
By an “out” I mean something like: (i) figuring out how to build competitive aligned optimizers, (ii) coordinating to avoid deploying unaligned AI.
Unfortunately I think  is a bit less impactful than it initially seems, at least if we live in a world of accelerating growth towards a singularity. For example, if the singularity is in 2045 and it’s 2035, and you were going to have catastrophic failure in 2040, you can’t really delay it by much calendar time. So  helps you by letting you wait until you get fancier technology from the fast outside economy, but doesn’t give you too much more time for the slow humane economy to “catch up” on its own terms.
I don’t think, from the perspective of humans monitoring single ML system running a concrete, quantifiable process—industry or mining or machine design—that it will be unexplainable. Just like today, tech stacks are already enormously complex, but at each layer someone does know how they work, and we know what they do at the layers that matter.
This seems like the key question.
Ever more complex designs for, say, a mining robot might start to resemble more and more some mix of living creatures and artwork out of a fractal, but we’ll still have reports that measure how much performance the design gives per cost.
I think that if we relate to our machines in the same way we relate to biological systems or ecologies, but AI systems actually understand those systems very well, then that’s basically what I mean.
Having reports about outcomes is a kind of understanding, but it’s basically the one I’m scared of (since e.g. it will be tough to learn about these kinds of systemic risks via outcome-driven reports, and attempts to push down near-misses may just transform them into full-blown catastrophes).
It seems like if Bob deploys an aligned AI, then it will ultimately yield control of all of its resources to Bob. It doesn’t seem to me like this would result in a worthless future even if every single human deploys such an AI.
The attractor I’m pointing at with the Production Web is that entities with no plan for what to do with resources—other than “acquire more resources”—have a tendency to win out competitively over entities with non-instrumental terminal values like “humans having good relationships with their children”
Quantitatively I think that entities without instrumental resources win very, very slowly. For example, if the average savings rate is 99% and my personal savings rate is only 95%, then by the time that the economy grows 10,000x my share of the world will have fallen by about half. The levels of consumption needed to maintain human safety and current quality of life seems quite low (and the high-growth during which they have to be maintained is quite low).
Also, typically taxes transfer (way more) than that much value from high-savers to low-savers. It’s not clear to me what’s happening with taxes in your story. I guess you are imagining low-tax jurisdictions winning out, but again the pace at which that happens is even slower and it is dwarfed by the typical rate of expropriation from war.
I think the difference between mine and your views here is that I think we are on track to collectively fail in that bargaining problem absent significant and novel progress on “AI bargaining” (which involves a lot of fairness/transparency) and the like, whereas I guess you think we are on track to succeed?
From my end it feels like the big difference is that quantitatively I think the overhead of achieving human values is extremely low, so the dynamics you point to are too weak to do anything before the end of time (unless single-single alignment turns out to be hard). I don’t know exactly what your view on this is.
If you agree that the main source of overhead is single-single alignment, then I think that the biggest difference between us is that I think that working on single-single alignment is the easiest way to make headway on that issue, whereas you expect greater improvements from some categories of technical work on coordination (my sense is that I’m quite skeptical about most of the particular kinds of work you advocate).
If you disagree, then I expect the main disagreement is about those other sources of overhead (e.g. you might have some other particular things in mind, or you might feel that unknown-unknowns are a larger fraction of the total risk, or something else).
I think I disagree with you on the tininess of the advantage conferred by ignoring human values early on during a multi-polar take-off. I agree the long-run cost of supporting humans is tiny, but I’m trying to highlight a dynamic where fairly myopic/nihilistic power-maximizing entities end up quickly out-competing entities with other values, due to, as you say, bargaining failure on the part of the creators of the power-maximizing entities.
Could you explain the advantage you are imagining? Some candidates, none of which I think are your view:
Single-single alignment failures—e.g. it’s easier to build a widget-maximizing corporation then to build one where shareholders maintain meaningful control
Global savings rates are currently only 25%, power-seeking entities will be closer to 100%, and effective tax rates will fall(e.g. because of competition across states)
Preserving a hospitable environment will become very expensive relative to GDP (and there are many species of this view, though none of them seem plausible to me)
Yes, you understand me here. I’m not (yet?) in the camp that we humans have “mostly” lost sight of our basic goals, but I do feel we are on a slippery slope in that regard. Certainly many people feel “used” by employers/ institutions in ways that are disconnected with their values. People with more job options feel less this way, because they choose jobs that don’t feel like that, but I think we are a minority in having that choice.
I think this is an indication of the system serving some people (e.g. capitalists, managers, high-skilled labor) better than others (e.g. the median line worker). That’s a really important and common complaint with the existing economic order, but I don’t really see how it indicates a Pareto improvement or is related to the central thesis of your post about firms failing to help their shareholders.
(In general wage labor is supposed to benefit you by giving you money, and then the question is whether the stuff you spend money on benefits you.))