PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter. http://rohinshah.com/
rohinmshah(Rohin Shah)
I don’t automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the “weak” warning shots discussed above.)
perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we’d get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn’t matter much.
Well, I think a case of an AI trying and failing to take over would provoke an even larger increase in caution, so I’d rephrase as
it would actually provoke a major increase in caution (assuming we weren’t already being very cautious)
I suppose the distinction between “strong” and “weak” warning shots would matter if we thought that we were getting “strong” warning shots. I want to claim that most people (including Evan) don’t expect “strong” warning shots, and usually mean the “weak” version when talking about “warning shots”, but perhaps I’m just falling prey to the typical mind fallacy.
If you think there’s something we are not on the same page about here—perhaps what you were hinting at with your final sentence—I’d be interested to hear it.
I’m not sure. Since you were pushing on the claim about failing to take over the world, it seemed like you think (the truth value of) that claim is pretty important, whereas I see it as not that important, which would suggest that there is some underlying disagreement (idk what it would be though).
Perhaps I should start saying “Guys, can we encourage folks to work on both issues please, so that people who care about x-risk have more ways to show up and professionally matter?”, and maybe that will trigger less pushback of the form “No, alignment is the most important thing”…
I think that probably would be true.
For some reason when I express opinions of the form “Alignment isn’t the most valuable thing on the margin”, alignment-oriented folks (e.g., Paul here) seem to think I’m saying you shouldn’t work on alignment (which I’m not), which triggers a “Yes, this is the most valuable thing” reply.
Fwiw my reaction is not “Critch thinks Rohin should do something else”, it’s more like “Critch is saying something I believe to be false on an important topic that lots of other people will read”. I generally want us as a community to converge to true beliefs on important things (part of my motivation for writing a newsletter) and so then I’d say “but actually alignment still seems like the most valuable thing on the margin because of X, Y and Z”.
(I’ve had enough conversations with you at this point to know the axes of disagreement, and I think you’ve convinced me that “which one is better on the margin” is not actually that important a question to get an answer to. So now I don’t feel as much of an urge to respond that way. But that’s how I started out.)
Not sure why I didn’t respond to this, sorry.
I agree with the claim “we may not have an AI system that tries and fails to take over the world (i.e. an AI system that tries but fails to release an engineered pandemic that would kill all humans, or arrange for simultaneous coups in the major governments, or have a robotic army kill all humans, etc) before getting an AI system that tries and succeeds at taking over the world”.
I don’t see this claim as particularly relevant to predicting the future.
Planned opinion (shared with What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs))
Both the previous story and this one seem quite similar to each other, and seem pretty reasonable to me as a description of one plausible failure mode we are aiming to avert. The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.
A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or working on coordination? I’ll note that my best guess is that for any given person, this effect is minor relative to “which of the two topics is the person more interested in?”, so it doesn’t seem hugely important to me. Nonetheless, my guess is that on the current margin, for technical research in particular, holding all else equal, it is more impactful to focus on intent alignment. You can see a much more vigorous discussion in e.g. [this comment thread](https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=3czsvErCYfvJ6bBwf).
Planned summary for the Alignment Newsletter:
A robust agent-agnostic process (RAAP) is a process that robustly leads to an outcome, without being very sensitive to the details of exactly which agents participate in the process, or how they work. This is illustrated through a “Production Web” failure story, which roughly goes as follows:
A breakthrough in AI technology leads to a wave of automation of $JOBTYPE (e.g management) jobs. Any companies that don’t adopt this automation are outcompeted, and so soon most of these jobs are completely automated. This leads to significant gains at these companies and higher growth rates. These semi-automated companies trade amongst each other frequently, and a new generation of “precision manufacturing″ companies arise that can build almost anything using robots given the right raw materials. A few companies develop new software that can automate $OTHERJOB (e.g. engineering) jobs. Within a few years, nearly all human workers have been replaced.
These companies are now roughly maximizing production within their various industry sectors. Lots of goods are produced and sold to humans at incredibly cheap prices. However, we can’t understand how exactly this is happening. Even Board members of the fully mechanized companies can’t tell whether the companies are serving or merely appeasing humanity; government regulators have no chance.
We do realize that the companies are maximizing objectives that are incompatible with preserving our long-term well-being and existence, but we can’t do anything about it because the companies are both well-defended and essential for our basic needs. Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.
Notice that in this story it didn’t really matter what job type got automated first (nor did it matter which specific companies took advantage of the automation). This is the defining feature of a RAAP—the same general story arises even if you change around the agents that are participating in the process. In particular, in this case competitive pressure to increase production acts as a “control loop” that ensures the same outcome happens, regardless of the exact details about which agents are involved.
Planned opinion (shared with Another (outer) alignment failure story):
Both the previous story and this one seem quite similar to each other, and seem pretty reasonable to me as a description of one plausible failure mode we are aiming to avert. The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.
A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or working on coordination? I’ll note that my best guess is that for any given person, this effect is minor relative to “which of the two topics is the person more interested in?”, so it doesn’t seem hugely important to me. Nonetheless, my guess is that on the current margin, for technical research in particular, holding all else equal, it is more impactful to focus on intent alignment. You can see a much more vigorous discussion in e.g. [this comment thread](https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=3czsvErCYfvJ6bBwf).
Planned summary for the Alignment Newsletter:
Suppose we train AI systems to perform task T by having humans look at the results that the AI system achieves and evaluating how well the AI has performed task T. Suppose further that AI systems generalize “correctly” such that even in new situations they are still taking those actions that they predict we will evaluate as good. This does not mean that the systems are aligned: they would still deceive us into _thinking_ things are great when they actually are not. This post presents a more detailed story for how such AI systems can lead to extinction or complete human disempowerment. It’s relatively short, and a lot of the force comes from the specific details that I’m not going to summarize, so I do recommend you read it in full. I’ll be explaining a very abstract version below.
The core aspects of this story are:
1. Economic activity accelerates, leading to higher and higher growth rates, enabled by more and more automation through AI.
2. Throughout this process, we see some failures of AI systems where the AI system takes some action that initially looks good but we later find out was quite bad (e.g. investing in a Ponzi scheme, that the AI knows is a Ponzi scheme but the human doesn’t).
3. Despite this failure mode being known and lots of work being done on the problem, we are unable to find a good conceptual solution. The best we can do is to build better reward functions, sensors, measurement devices, checks and balances, etc. in order to provide better reward functions for agents and make it harder for them to trick us into thinking their actions are good when they are not.
4. Unfortunately, since the proportion of AI work keeps increasing relative to human work, this extra measurement capacity doesn’t work forever. Eventually, the AI systems are able to completely deceive all of our sensors, such that we can’t distinguish between worlds that are actually good and worlds which only appear good. Humans are dead or disempowered at this point.(Again, the full story has much more detail.)
Planned summary for the Alignment Newsletter:
This podcast covers a bunch of topics, such as <@debate@>(@AI safety via debate@), <@cross examination@>(@Writeup: Progress on AI Safety via Debate@), <@HCH@>(@Humans Consulting HCH@), <@iterated amplification@>(@Supervising strong learners by amplifying weak experts@), and <@imitative generalization@>(@Imitative Generalisation (AKA ‘Learning the Prior’)@) (aka [learning the prior](https://www.alignmentforum.org/posts/SL9mKhgdmDKXmxwE4/learning-the-prior) ([AN #109](https://mailchi.mp/ee62c1c9e331/an-109teaching-neural-nets-to-generalize-the-way-humans-would))), along with themes about <@universality@>(@Towards formalizing universality@). Recommended for getting a broad overview of this particular area of AI alignment.
I agree this involves discretion [...] So instead I’m doing some in between thing
Yeah, I think I feel like that’s the part where I don’t think I could replicate your intuitions (yet).
I don’t think we disagree; I’m just noting that this methodology requires a fair amount of intuition / discretion, and I don’t feel like I could do this myself. This is much more a statement about what I can do, rather than a statement about how good the methodology is on some absolute scale.
(Probably I could have been clearer about this in the original opinion.)
In some sense you could start from the trivial story “Your algorithm didn’t work and then something bad happened.” Then the “search for stories” step is really just trying to figure out if the trivial story is plausible. I think that’s pretty similar to a story like: “You can’t control what your model thinks, so in some new situation it decides to kill you.”
To fill in the details more:
Assume that we’re finding an algorithm to train an agent with a sufficiently large action space (i.e. we don’t get safety via the agent having such a restricted action space that it can’t do anything unsafe).
It seems like in some sense the game is in constraining the agent’s cognition to be such that it is “safe” and “useful”. The point of designing alignment algorithms is to impose such constraints, without requiring so much effort as to make the resulting agent useless / uncompetitive.
However, there are always going to be some plausible circumstances that we didn’t consider (even if we’re talking about amplified humans, which are still bounded agents). Even if we had maximal ability to place constraints on agent cognition, whatever constraints we do place won’t have been tested in these unconsidered plausible circumstances. It is always possible that one misfires in a way that makes the agent do something unsafe.
(This wouldn’t be true if we had some sort of proof against misfiring, that doesn’t assume anything about what circumstances the agent experiences, but that seems ~impossible to get. I’m pretty sure you agree with that.)
More generally, this story is going to be something like:
Suppose you trained your model M to do X using algorithm A.
Unfortunately, when designing algorithm A / constraining M with A, you (or amplified-you) failed to consider circumstance C as a possible situation that might happen.
As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.
Circumstance C then happens in the real world, leading to an actual failure.
Obviously, I can’t usually instantiate M, X, A, C, and H such that the story works for an amplified human (since they can presumably think of anything I can think of). And I’m not arguing that any of this is probable. However, it seems to meet your bar of “plausible”:
there is some way to fill in the rest of the details that’s consistent with everything I know about the world.
EDIT: Or maybe more accurately, I’m not sure how exactly the stories you tell are different / more concrete than the ones above.
----
When I say you have “a better defined sense of what does and doesn’t count as a valid step 2”, I mean that there’s something in your head that disallows the story I wrote above, but allows the stories that you generally use, and I don’t know what that something is; and that’s why I would have a hard time applying your methodology myself.
----
Possible analogy / intuition pump for the general story I gave above: Human cognition is only competent in particular domains and must be relearned in new domains (like protein folding) or new circumstances (like when COVID-19 hits), and sometimes human cognition isn’t up to the task (like when being teleported to a universe with different physics and immediately dying), or doesn’t do so in a way that agrees with other humans (like how some humans would push a button that automatically wirehead everyone for all time, while others would find that abhorrent).
Looks good to me :)
Planned summary for the Alignment Newsletter:
This post outlines a simple methodology for making progress on AI alignment. The core idea is to alternate between two steps:
1. Come up with some alignment algorithm that solves the issues identified so far
2. Try to find some plausible situation in which either a) the resulting AI system is misaligned or b) the AI system is not competitive.
This is all done conceptually, so step 2 can involve fairly exotic scenarios that probably won’t happen. Given such a scenario, we need to argue why no failure in the same class as that scenario will happen, or we need to go back to step 1 and come up with a new algorithm.
This methodology could play out as follows:
Step 1: RL with a handcoded reward function.
Step 2: This is vulnerable to <@specification gaming@>(@Specification gaming examples in AI@).
Step 1: RL from human preferences over behavior, or other forms of human feedback.
Step 2: The system might still pursue actions that are bad that humans can’t recognize as bad. For example, it might write a well researched report on whether fetuses are moral patients, which intuitively seems good (assuming the research is good). However, this would be quite bad if the AI wow the report because it calculated that it would increase partisanship leading to civil war.
Step 1: Use iterated amplification to construct a feedback signal that is “smarter” than the AI system it is training.
Step 2: The system might pick up on <@inaccessible information@>(@Inaccessible information@) that the amplified overseer cannot find. For example, it might be able to learn a language just by staring at a large pile of data in that language, and then seek power whenever working in that language, and the amplified overseer may not be able to detect this.
Step 1: Use <@imitative generalization@>(@Imitative Generalisation (AKA ‘Learning the Prior’)@) so that the human overseer can leverage facts that can be learned by induction / pattern matching, which neural nets are great at.
Step 2: Since imitative generalization ends up learning a description of facts for some dataset, it may learn low-level facts useful for prediction on the dataset, while not including the high-level facts that tell us how the low-level facts connect to things we care about.
The post also talks about various possible objections you might have, which I’m not going to summarize here.
Planned opinion:
I’m a big fan of having a candidate algorithm in mind when reasoning about alignment. It is a lot more concrete, which makes it easier to make progress and not get lost, relative to generic reasoning from just the assumption that the AI system is superintelligent.
I’m less clear on how exactly you move between the two steps—from my perspective, there is a core reason for worry, which is something like “you can’t fully control what patterns of thought your algorithm learn, and how they’ll behave in new circumstances”, and it feels like you could always apply that as your step 2. Our algorithms are instead meant to chip away at the problem, by continually increasing our control over these patterns of thought. It seems like the author has a better defined sense of what does and doesn’t count as a valid step 2, and that makes this methodology more fruitful for him than it would be for me. More discussion [here](https://www.alignmentforum.org/posts/EF5M6CmKRd6qZk27Z/my-research-methodology?commentId=8Hq4GJtnPzpoALNtk).
I don’t think similarly-sized transformers would do much better and might do worse. Section 3.4 shows that large models trained from scratch massively overfit to the data. I vaguely recall the authors saying that similarly-sized transformers tended to be harder to train as well.
Does this mean that this fine-tuning process can be thought of as training a NN that is 3 OOMs smaller, and thus needs 3 OOMs fewer training steps according to the scaling laws?
My guess is that the answer is mostly yes (maybe not the exact numbers predicted by existing scaling laws, but similar ballpark).
how does that not contradict the scaling laws for transfer described here and used in this calculation by Rohin?
I think this is mostly irrelevant to timelines / previous scaling laws for transfer:
You still have to pretrain the Transformer, which will take the usual amount of compute (my calculation that you linked takes this into account).
The models trained in the new paper are not particularly strong. They are probably equivalent in performance to models that are multiple orders of magnitude smaller trained from scratch. (I think when comparing against training from scratch, the authors did use smaller models because that was more stable, though with a quick search I couldn’t find anything confirming that right now.) So if you think of the “default” as “train an X-parameter model from scratch”, then to get equivalent performance you’d probably want to do something like “pretrain a 100X-parameter model, then finetune 0.1% of its weights”. (Numbers completely made up.)
I expect there are a bunch of differences in how exactly models are trained. For example, the scaling law papers work almost exclusively with compute-optimal training, whereas this paper probably works with models trained to convergence.
You probably could come to a unified view that incorporates both this new paper and previous scaling law papers, but I expect you’d need to spend a bunch of time getting into the minutiae of the details across the two methods. (Probably high tens to low hundreds of hours.)
Yes, that’s basically right.
You think I take the original argument to be arguing from ‘has goals’ to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.
Well, I do think it is an interesting/relevant argument (because as you say it explains how you get from “weakly has goals” to “strongly has goals”). I just wanted to correct the misconception about what I was arguing against, and I wanted to highlight the “intelligent” --> “weakly has goals” step as a relatively weak step in our current arguments. (In my original post, my main point was that that step doesn’t follow from pure math / logic.)
In that case, my current understanding is that you are disagreeing with 2, and that you agree that if 2 holds in some case, then the argument goes through.
At least, the argument makes sense. I don’t know how strong its effect is—basically I agree with your phrasing here:
This force probably doesn’t exist out at the zero goal directness edges, but it unclear how strong it is in the rest of the space—i.e. whether it becomes substantial as soon as you move out from zero goal directedness, or is weak until you are in a few specific places right next to ‘maximally goal directed’.)
Thanks, that’s helpful. I’ll think about how to clarify this in the original post.
You’re mistaken about the view I’m arguing against. (Though perhaps in practice most people think I’m arguing against the view you point out, in which case I hope this post helps them realize their error.) In particular:
Whatever things you care about, you are best off assigning consistent numerical values to them and maximizing the expected sum of those values
If you start by assuming that the agent cares about things, and your prior is that the things it cares about are “simple” (e.g. it is very unlikely to be optimizing the-utility-function-that-makes-the-twitching-robot-optimal), then I think the argument goes through fine. According to me, this means you have assumed goal-directedness in from the start, and are now seeing what the implications of goal-directedness are.
My claim is that if you don’t assume that the agent cares about things, coherence arguments don’t let you say “actually, principles of rationality tell me that since this agent is superintelligent it must care about things”.
Stated this way it sounds almost obvious that the argument doesn’t work, but I used to hear things that effectively meant this pretty frequently in the past. Those arguments usually go something like this:
By hypothesis, we will have superintelligent agents.
A superintelligent agent will follow principles of rationality, and thus will satisfy the VNM axioms.
Therefore it can be modeled as an EU maximizer.
Therefore it pursues convergent instrumental subgoals and kill us all.
This talk for example gives the impression that this sort of argument works. (If you look carefully, you can see that it does state that the AI is programmed to have “objects of concern”, which is where the goal-directedness assumption comes in, but you can see why people might not notice that as an assumption.)
----
You might think “well, obviously the superintelligent AI system is going to care about things, maybe it’s technically an assumption but surely that’s a fine assumption”. I think on balance I agree, but it doesn’t seem nearly so obvious to me, and seems to depend on how exactly the agent is built. For example, it’s plausible to me that superintelligent expert systems would not be accurately described as “caring about things”, and I don’t think it was a priori obvious that expert systems wouldn’t lead to AGI. Similarly, it seems at best questionable whether GPT-3 can be accurately described as “caring about things”.
----
As to whether this argument is relevant for whether we will build goal-directed systems: I don’t think that in isolation my argument should strongly change your view on the probability you assign to that claim. I see it more as a constraint on what arguments you can supply in support of that view. If you really were just saying “VNM theorem, therefore 99%”, then probably you should become less confident, but I expect in practice people were not doing that and so it’s not obvious how exactly their probabilities should change.
----
I’d appreciate advice on how to change the post to make this clearer—I feel like your response is quite common, and I haven’t yet figured out how to reliably convey the thing I actually mean.
But for more general infradistributions this need not be the case. For example, consider and take the set of a-measures generated by and . Suppose you start with dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting dollars on the outcome , with a value of dollars.
I guess my question is more like: shouldn’t there be some aspect of reality that determines what my set of a-measures is? It feels like here we’re finding a set of a-measures that rationalizes my behavior, as opposed to choosing a set of a-measures based on the “facts” of the situation and then seeing what behavior that implies.
I feel like we agree on what the technical math says, and I’m confused about the philosophical implications. Maybe we should just leave the philosophy alone for a while.
This probably isn’t the thing you mean, but your description kinda sounds like tessellating hills and its predecessor demons in perfect search.