Epistemologist specialized in the difficulties of alignment. Currently at Conjecture, and running Refine.
adamShimi(Adam Shimi)
EDIT: This comment fails on a lot of points, as discussed in this apology subcomment. I encourage people interested by the thread to mostly read the apology subcomment and the list of comments linked there, which provide maximum value with minimum drama IMO.
Disclaimer: this is a rant. In the best possible world, I could write from a calmer place, but I’m pretty sure that the taboo on criticizing MIRI and EY too hard on the AF can only be pushed through when I’m annoyed enough. That being said, I’m writing down thoughts that I had for quite some time, so don’t just discard this as a gut reaction to the post.
(Later added disclaimer: it’s a good idea to add “I feel like...” before the judgment in this comment, so that you keep in mind that I’m talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))
Tl;dr:
I’m annoyed by EY (and maybe MIRI’s?) dismissal of every other alignment work, and how seriously it seems to be taken here, given their track record of choosing research agendas with very indirect impact on alignment, and of taking a lot of time to let go of these flawed agendas in the face of mounting evidence.
I’m annoyed that I have to deal with the nerd-sniping introduced by MIRI when bringing new people to the field, especially given the status imbalance.
I’m sad that EY and MIRI’s response to their research agenda not being as promising as they wanted is “we’re all doomed”.
Honestly, I really, really tried to find how MIRI’s Agents Foundations agenda was supposed to help with alignment. I really did. Some people tried to explain it to me. And I wanted to believe, because logic and maths are amazing tools with which to attack this most important problem, alignment. Yet I can’t escape the fact that the only contributions to technical alignment I can see by MIRI have been done by a bunch of people who mostly do their own thing instead of following MIRI’s core research program: Evan, Vanessa, Abram and Scott. (Note that this is my own judgement, and I haven’t talked to these people about this comment, so if you disagree with me, please don’t go at them).
All the rest, including some of the things these people worked on (but not most of it), is nerd-sniping as far as I’m concerned. It’s a tricky failure mode because it looks like good and serious research to the AF and LW audience. But there’s still a massive difference with actually trying to solve the real problems related to alignment, with all the tools at our disposal, and deciding that the focus should be on a handful of toy problems neatly expressed with decision theory and logic, and then only work on those.
That’s already bad enough. But then we have posts like this one, where EY just dunks on everyone working on alignment as fakers, or having terrible ideas. And at that point, I really wonder: why is that supposed to be an argument of authority anymore? Yes, I give massive credibility points to EY and MIRI for starting the field of alignment almost by themselves, and for articulating a lot of the issues. Yet all of the work that looked actually pushed by the core MIRI team (and based on some of EY’s work) from MIRI’s beginning are just toying with logic problems with hardly any connections here and there to alignment. (I know they’re not publishing most of it, but that judgment applies to their currently published work, and from the agenda and the blog posts, it sounds like most of the unpublished work was definitely along those lines). Similarly, the fact that they kept at it over and over with all the big improvement of DL instead of trying to adapt to prosaic Alignment sounds like evidence that they might be over attached to a specific framing, which they had trouble to discard.
Note that this also has massive downsides for conceptual alignment in general, because when bringing people in, you have to deal with this specter of nerd-sniping by the founding lab of the field and still a figure of authority. I have to explain again and again to stressed-out newcomers that you definitely don’t need to master model theory or decision theory to do alignment, and try to steer them towards problems and questions that look like they’re actually moving the ball instead of following the lead of the “figure of authority”.
When I’m not frustrated by this situation, I’m just sad. Some of the brightest and first thinkers on alignment have decided to follow their own nerd-sniping and call everyone else fakers, and when they realized they were not actually making progress, they didn’t switch to something else as much as declare everyone was still full of it and so because they had no idea of how to solve the problem at the moment, it was doomed.
What could be done differently? Well, I would really, really like if EY and other MIRI people who are very dubious of most alignment research could give more feedback on that and enter the dialogue, maybe by commenting more on the AF. My problem is not so much with them disagreeing with most of the work, it’s about the disagreement stopping to “that’s not going to work” and not having dialogue and back and forth.
Also, I don’t know how much is related to mental health and pessimism and depression (which I completely understand can color one’s view of the world), but I would love to see the core MIRI team and EY actually try solving alignment with neural nets and prosaic AI. Starting with all their fears and caveats, sure, but then be like “fuck it, let’s just find a new way of grappling it”. That’s really what saddens me the most about all of this: I feel that some of the best current minds who care about alignment have sort of given up on actually trying to solve it.
This post is amazing. Not just good, but amazing. You manage to pack exactly the lesson I needed to hear with just the right amount of memes and cheekiness to also be entertaining.
I would genuinely not be surprised if the frame in this post (and the variations I’m already adding to it) proved one of the key causal factors in me being far more productive and optimizing as an alignment researcher.
One suggestion: let’s call these trees treeory of change, because that’s what they are. ;)
Thanks. Really.
Thanks for this post!
That being said, my model of Yudkowsky, which I built by spending time interpreting and reverse engineering the post you’re responding to, feels like you’re not addressing his points (obviously, I might have missed the real Yudkowsky’s point)
My interpretation is that he is saying that Evolution (as the generator of most biological anchors) explores the solution space in a fundamentally different path than human research. So what you have is two paths through a space. The burden of proof for biological anchors thus lies in arguing that there are enough connections/correlations between the two paths to use one in order to predict the other.
Here it sounds like you’re taking as an assumption that human research follows the same or a faster path towards the same point in search space. But that’s actually the assumption that IMO Yudkowsky is criticizing!
In his piece, Yudkowsky is giving arguments that the human research path should lead to more efficient AGIs than evolution, in part due to the ability of humans to have and leverage insights, which the naive optimization process of evolution can’t do. He also points to the inefficiency of biology in implementing new (in geological-time) complex solutions. On the other hand, he doesn’t seem to see a way of linking the amount of resources needed by evolution to the amount of resources needed by human research, because they are so different.
If the two paths are very different and don’t even aim at the same parts of the search space, there’s nothing telling you that computing the optimization power of the first path helps in understanding the second one.
I think Yudkowsky would agree that if you could estimate the amount of resources needed to simulate all evolution until humans at the level of details that you know is enough to capture all relevant aspects, that amount of resources would be an upper bound on the time taken by human research because that’s a way to get AGI if you have the resources. But the number is so vastly large (and actually unknown due to the “level of details” problem) that it’s not really relevant for timelines calculations
(Also, I already had this discussion with Daniel Kokotajlo in this thread, but I really think that Platt’s law is one of the least cruxy aspects of the original post. So I don’t think discussing it further or pointing attention to it is a good idea)
I see at least two problems with your argument:
There’s an assumption that you need a single agent to lead to existential risk. This is not the case, and many scenarios explored require only competent and autonomous service like AIs, or foundations models. Like, CAIS is a model of intelligence explosion and has existential risks type failure modes too.
There’s an assumption that just because the non AGI models are useful, labs will stop pursuing AGI. Yet this is visibly false, as the meme of AGI is running around and there are multiple labs who are explicitly pushing for AGI and getting the financial leeway to do it.
More generally, this post has the typical problem of “here is a scenario that looks plausible and would be nice, so there’s no need to worry”. Sure, maybe this is the actual scenario that will come to pass, and maybe it’s possible to argue for it convincingly. But you should require one damn strong argument before pushing people to not even work to deal with the many more possible numerous worlds where things go horribly wrong.
One thing I’m confused about on the subject of rationalist group houses is whether there are specific failure modes compared to just group houses. Like I’m certain I don’t want to live in a group house, just because I don’t want to have to deal with that many people in the place I live, but the group house being rationalist or not is irrelevant for that.
Don’t have the time to write a long comment just now, but I still wanted to point out that describing either Yudkowsky or Christiano as doing mostly object-level research seems incredibly wrong. So much of what they’re doing and have done focused explicitly on which questions to ask, which question not to ask, which paradigm to work in, how to criticize that kind of work… They rarely published posts that are only about the meta-level (although Arbital does contain a bunch of pages along those lines and Prosaic AI Alignment is also meta) but it pervades their writing and thinking.
More generally, when you’re creating a new field of science of research, you tend to do a lot of philosophy of science type stuff, even if you don’t label it explicitly that way. Galileo, Carnot, Darwin, Boltzmann, Einstein, Turing all did it.
(To be clear, I’m pointing at meta-stuff in the sense of “philosophy of science for alignment” type things, not necessarily the more hardcore stuff discussed in the original post)
In “Why Read The Classics?”, Italo Calvino proposes many different definitions of a classic work of literature, including this one:
A classic is a book which has never exhausted all it has to say to its readers.
For me, this captures what makes this sequence and corresponding paper a classic in the AI Alignment literature: it keeps on giving, readthrough after readthrough. That doesn’t mean I agree with everything in it, or that I don’t think it could have been improved in terms of structure. But when pushed to reread it, I found again and again that I had missed or forgotten some nice argument, some interesting takeaway.
With that, a caveat: I’m collaborating with Evan Hubinger (one of the authors) on projects related to ideas introduced in this sequence, especially to Deceptive Alignment. I am thus probably biased positively about this work. That being said, I have no problem saying I disagree with collaborators, so I don’t think I’m too biased to write this review.
(Small point: I among other people tend to describe this sequence/paper as mainly Evan’s work, but he repeatedly told me that everyone participated equally, and that the names are in alphabetic order, not contribution order. So let’s keep that in mind)
Summary
Let’s start the review proper with a post by post summary (except for the conclusion):
(Introduction) This first post introduces the idea of mesa-optimizers, the learned optimizers from the title. A mesa-optimizer is an optimizer which is the result of a learning process, and it comes with the issue of inner alignment: how aligned is the objective of the mesa-optimizer (on which we don’t have direct control) with the objective of the base-optimizer that produced this mesa-optimizer?
The post then split the safety questions related to mesa-optimizer in two categories: understanding which conditions make mesa-optimizer appear; and understanding how aligned is the mesa-objective with the base-objective.(Conditions for Mesa-Optimization) This post tackles the first category outlined in the introduction: how can mesa-optimizers be learned? The task can push towards mesa-optimization by asking for more generalization (which is probably easier to deliver through search), by requiring a compressed complex policy, or by requiring human modeling (which probably entails understanding optimization and search in some sense). The base-optimiser can push towards mesa-optimization if it is reachable (not surrounded by high-loss solutions), if the models considered have enough algorithmic range, or more generally through details of the inductive bias like statefulness and simplicity bias.
(The Inner Alignment Problem) This post tackles the second category outlined in the introduction: if a mesa-optimizer does appear, how aligned will it be with the base objective? The misalignment considered here is called pseudo-alignment: being aligned on the training distribution but not at deployment. The authors propose to split pseudo-alignment in three subcategories:
Proxy alignment, where the mesa-objective is a proxy for the base-objective on the training distribution, but not necessarily elsewhere.
Approximate alignment, where the difference comes from the inability of the mesa-optimizer to represent the base-objective, and thus it learns an approximation.
Suboptimality alignment, where the mesa-objective is not the base-objective at all, but the mesa-optimizer makes decisions (through mistakes or deception) on the training distribution that fit with the base-objective even if it contradicts the mesa-objective.
The post also explores how the task and the base-optimizer can influence the apparition of pseudo-alignment assuming mesa-optimizers, and on which subcategory it falls.
(Deceptive Alignment) This post focuses on a specific instance of suboptimality alignment: deceptive alignment, where the mesa-optimizer is trying to deceive the base-optimizer in order to not be deployed and then change its behavior to pursue the mesa-objective.
Among other ideas, the discussion examines necessary conditions for deceptive alignment (objective across parameter updates, self-modeling as learned model, expect eventual deployment without modification), how training can reinforce deception, and whether making the deceptive system thinks it’s still in training might deal with the problem.
Value
What is new in this? After all, the idea that training on a reward/objective might result in a model that doesn’t generalize correctly is hardly newsworthy, and wasn’t in 2019.
What’s missing here is replacing this idea in the context of safety. I’m always worried about saying “This is the first place some concept has been defined/mentioned”. But it’s safe to say that a lot of AI Alignment resources prior to this sequence centered around finding the right objective. The big catastrophic scenarios came from issues like the Orthogonality Thesis and the fragility of value, for which the solution seems obviously to find the right objective, and maybe add/train for good properties like corrigibility. Yet both ML theory and practice already knew that issues didn’t stop there.
So the value of this sequence comes in recasting the known generalization problems from classic ML in the context of alignment, in a public and easily readable form. Remember, I’m hardly saying nobody knew about it in the AI Alignment community before that sequence. But it is hard to find well-read and cited posts and discussions about the subject predating this sequence. I for one didn’t really think about such issues before reading this sequence and starting to work with Evan.
The other big contribution of this sequence is the introduction of deceptive alignment. Considering deception from within the trained model during its training is similar to some other previous ideas about deception (for example boxed AI getting out), but to my knowledge this is the first full fledged argument for how this could appear from local search, and even be maintained and reinforced. So deceptive alignment can be seen as recasting a traditional AI risk in the more recent context of prosaic AGI.
Criticisms
One potential issue with the sequence is its use of optimizers (programs doing explicit internal search over policies) as the problematic learned models. It makes sense from the formal point of view, since this assumption simplifies the analysis of the corresponding mesa-optimizers, and allows a relatively straightforward definition of notions like mesa-objective and inner alignment.
Yet this assumption has been criticized by multiple researchers in the community. For example, RIchard Ngo argues that for the kind of models trained through local search (like neural networks), it’s not obvious what “doing internal search means. Others, like Tom Everitt, defend that systems not doing internal search should be included in the discussion of inner alignment.
I’m sympathetic to both criticisms and would like to see someone attempt a similar take without this assumption—see the directions for further research below.
Another slight issue I have with this sequence comes from its density: some very interesting ideas end up getting lost in it. As one example, take the tradeoff from reducing time complexity, which helps to not create mesa-optimizer but increase the risk of pseudo-alignment if mesa-optimizers do appear. The first part is discussed in Conditions for Mesa-Optimization, and the second in The Inner Alignment Problem. But it’s deep inside the text—there’s no way for a casual reader or a quick rereader to know it is here. I think this could have been improved, even if it’s almost nitpicking at this point.
Follow-up research
What was the influence of this sequence? Google scholar returns only 8 citations, but this is misguided—most of the impact is on researchers who don’t publish papers that often. It seems more relevant to look at pingbacks from Alignment Forum posts. I count 62 such AF posts, not including the ones from the sequence itself (and without accounting for redundancy). That’s quite impressive.
Here is a choice of the most interesting from my perspective:
Abram Demski’s Selection vs Control, which crystallized an important dichotomy in how we think about optimizers
Adam Scholl’s Matt Botvinick on the spontaneous emergence of learning algorithms, who attempted to present an example of mesa-optimization, and sparked a big discussion about the meaning of the term, how surprising it should be, and even the need for more RL education in the AI Alignment community (see this comment thread for the “gist”).
Evan Hubinger’s Gradient Hacking, which expanded on the case with deceptive alignment where the trained system can influence only through its behavior what happens next in training. I think this is a big potential issue, which is why I’m investigating it with Evan.
Evan Hubinger’s Clarifying Inner alignment terminology, which replaced the term inner alignment in the context of mesa-optimizers (as defined initially in the sequence), and proposed a decomposition of the alignment problem.
Directions for further research
Mostly, I would be excited with two axes of research around this sequence:
Trying to break the arguments from this sequence. Either poking hole in them and showing why they might not work, or find reasonable assumptions for which they don’t work. Whether holes are found or no attacks breaks the reasoning, I think we will have learned quite a lot.
Trying to make the arguments in this sequence work without the optimization assumption for the learned models. I’m thinking either by assuming that the system will be well predicted by thinking of it as optimizing something, or through a more general idea of goal-directedness. (Evan is also quite interested in this project, so if it excites you, feel free to contact him!)
(Later added disclaimer: it’s a good idea to add “I feel like...” before the judgment in this comment, so that you keep in mind that I’m talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))
Okay, so you’re completely right that a lot of my points are logically downstream of the debate on whether Prosaic Alignment is Impossible or not. But I feel like you don’t get how one sided this debate is, and how misrepresented it is here (and generally on the AF)
Like nobody except EY and a bunch of core MIRI people actually believes that prosaic alignment is impossible. I mean that every other researcher that I know think Prosaic Alignment is possible, even if potentially very hard. That includes MIRI people like Evan Hubinger too. And note that some of these other alignment researchers actually work with Neural Nets and keep up to speed on the implementation details and subtleties, which in my book means their voice should count more.
But that’s just a majority argument. The real problem is that nobody has ever given a good argument on why this is impossible. I mean the analogous situation is that a car is driving right at you, accelerating, and you’ve decided somehow that it’s impossible to ever stop it before it kills you. You need a very strong case before giving up like that. And that has not been given by EY and MIRI AFAIK.
The last part of this is that because EY and MIRI founded the field, their view is given far more credibility than what it would have on the basis of the arguments alone, and far more than it has in actual discussions between researchers.
The best analogy I can find (a bit strawmanish but less than you would expect) is a world where somehow the people who had founded the study of cancer had the idea that no method based on biological experimentation and thinking about cells could ever cure cancer, and that the only way of solving it was to understand every dynamics in a very advanced category theoretic model. Then having found the latter really hard, they just say that curing cancer is impossible.
Thanks for the post and expressing your opinion!
That being, I feel like there is a misunderstanding here. Daniel mentioned that in another comment thread, but I don’t think Eliezer claims what you’re attributing to him, nor that your analogy with financial pundits works in this context.
My model of Eliezer, based on reading a lot of his posts (old and new) and one conversation, is that he’s dunking on Metaculus and forecasters for a combination of two epistemic sins:
Taking a long time to update on available information Basically, you shouldn’t take so long to update on the risk for AI, the accelerating pace, the power of scaling. I don’t think Eliezer is perfect on this, but he definitely can claim that he thought and invested himself in AI risks literally decades before any metaculus forecaster even thought about the topic. This is actually a testable claim: that forecasts ends up trailing things that Eliezer said 10 years later.
Doing a precise prediction when you don’t have the information I feel like there’s been a lot of misunderstanding about why Eliezer doesn’t want to give timeline predictions, when he said it repeatedly: he thinks there is just not enough bits of evidence for making a precise prediction. There is enough evidence to be pessimistic, and realize we’re running out of time, but I think he would see giving a precise year like a strong epistemic sin. Realize when you have very little evidence, instead of inventing some to make your forecast more concrete.[1]
As for the financial pundit example, there’s a massive disanalogy: it’s easy to predict that there will be a crash. Everybody does it, we have past examples to generalize from, and models and theories accepted by a lot of people for why they might be inevitable. On the other hand, when Eliezer started talking about AI Risks and investing himself fully in them, nobody gave a shit about it or took it seriously. This was not an obvious prediction that everyone was making, and he gave far more details than just saying “AI Risks, man”.
Note that I’m not saying that Eliezer has a perfect track record or that you shouldn’t criticize him. On the first point, I feel like he had a massive miss of GPT-like models, which are incoherent with the models of intelligence and agency that Eliezer used in the sequences and at MIRI — that’s a strong failed prediction for me, a qualitative unknown unknown that was missed. And on the second point, I’m definitely for more productive debate around alignment and Eliezer’s position.
I just wanted to point out ways in which your post seemed to discuss a strawman, which I don’t think was your intention.
Selection vs Control is a distinction I always point to when discussing optimization. Yet this is not the two takes on optimization I generally use. My favored ones are internal optimization (which is basically search/selection), and external optimization (optimizing systems from Alex Flint’s The ground of optimization). So I do without control, or at least without Abram’s exact definition of control.
Why? Simply because the internal structure vs behavior distinction mentioned in this post seems more important than the actual definitions (which seem constrained by going back to Yudkowski’s optimization power). The big distinction is between doing internal search (like in optimization algorithms or mesa-optimizers) and acting as optimizing something. It is intuitive that you can do the second without the first, but before Alex Flint’s definition, I couldn’t put words on my intuition than the first implies the second.
So my current picture of optimization is Internal Optimization (Internal Search/Selection) \subset External Optimization (Optimizing systems). This means that I think of this post as one of the first instances of grappling at this distinction, without agreeing completely with the way it ends up making that distinction.
- 10 Jun 2021 12:53 UTC; 4 points) 's comment on Search-in-Territory vs Search-in-Map by (
(Later added disclaimer: it’s a good idea to add “I feel like...” before the judgment in this comment, so that you keep in mind that I’m talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))
Thanks for trying to understand my point and asking me for more details. I appreciate it.
Yet I feel weird when trying to answer, because my gut reaction to your comment is that you’re asking the wrong question? Also, the compression of my view to “EY’s stances seem to you to be mostly distracting people from the real work” sounds more lossy than I’m comfortable with. So let me try to clarify and focus on these feelings and impressions, then I’ll answer more about which success stories or directions excite me.
My current problem with EY’s stances is twofold:
First, in posts like this one, he literally writes that everything done under the label of alignment is faking it and not even attacking the problem, except like 3 people who even if they’re trying have it all wrong. I think this is completely wrong, and that’s even more annoying because I find that most people working on alignment are trying far harder harder to justify why they expect their work to matter than EY and the old-school MIRI team ever did.
This is a problem because it doesn’t help anyone working on the field to maybe solve the problems with their approaches that EY sees, which sounds like a massive missed opportunity.
This is also a problem because EY’s opinions are still quite promoted in the community (especially here on the AF and LW), such that newcomers going for what the founder of the field has to say go away with the impression that no one is doing valuable work.
Far more speculative (because I don’t know EY personally), but I expect that kind of judgment to not come so much from a place of all encompassing genius but instead from generalization after reading some posts/papers. And I’ve received messages following this thread of people who were just as annoyed as I, and felt their results had been dismissed without even a comment or classified as trivial when everyone else, including the authors, were quite surprised by them. I’m ready to give EY a bit of “he just sees further than most people”, but not enough that he can discard the whole field from reading a couple of AF posts.
Second, historically, a lot of MIRI’s work has followed a specific epistemic strategy of trying to understand what are the optimal ways of deciding and thinking, both to predict how an AGI would actually behave and to try to align it. I’m not that convinced by this approach, but even giving it the benefit of the doubt, it has by no way lead to any accomplishments big enough to justify EY (and MIRI’s ?) highly veiled contempt for anyone not doing that. This had and still has many bad impacts on the field and new entrants.
A specific subgroup of people tend to be nerd-sniped by this older MIRI’s work, because it’s the only part of the field that is more formal, but IMO at the loss of most of what matters about alignment and most of the grounding.
People who don’t have the technical skill to work on MIRI’s older work feel like they have to skill up drastically in maths to be able to do anything relevant in alignment. I literally mentored three people like that, who could actually do a lot of good thinking and cared about alignment, and had to push it in their head that they didn’t need super advanced maths skills, except if they wanted to do very very specific things.
I find that particularly sad because IMO the biggest positive contribution to the field by EY and early MIRI comes from their less formal and more philosophical work, which is exactly the kind of work that is stilted by the consequences of this stance.I also feel people here underestimate how repelling this whole attitude has been for years for most people outside the MIRI bubble. From testimonials by a bunch of more ML people and how any discussion of alignment needs to clarify that you don’t share MIRI’s contempt with experimental work and not doing only decision theory and logic, I expect that this has been one of the big factors in alignment not being taken seriously and people not wanting to work on it.
Also important to note that I don’t know if EY and MIRI still think this kind of technical research is highly valuable and the real research and what should be done, but they have been influential enough that I think a big part of the damage is done, and I read some parts of this post as “If only we could do the real logic thing, but we can’t so we’re doomed”. Also there’s a question of the separation between the image that MIRI and EY projects and what they actually think.
Going back to your question, it has a weird double standard feel. Like, every AF post on more prosaic alignment methods comes with its success story, and reason for caring about the research. If EY and MIRI want to argue that we’re all doomed, they have the burden of proof to explain why everything that’s been done is terrible and will never lead to alignment. Once again, proving that we won’t be able to solve a problem is incredibly hard and improbable. Funny how everyone here gets that for the “AGI is impossible question”, but apparently that doesn’t apply to “Actually working with AIs and Thinking about real AIs will never let you solve alignment in time.”
Still, it’s not too difficult to list a bunch of promising stuff, so here’s a non-exhaustive list:
John Wentworth’s Natural Abstraction Hypothesis, which is about checking his formalism-backed intuition that NNs actually learn similar abstractions that humans do. The success story is pretty obvious, in that if John is right, alignment should be far easier.
People from EleutherAI working on understanding LMs and GPT-like models as simulators of processes (called simulacra), as well as the safety benefits (corrigibility) and new strategies (leveraging the output distribution in smart ways) that this model allows.
Evan Hubinger’s work on finding predicates that we could check during training to avoid deception and behaviors we’re worried about. He has a full research agenda but it’s not public yet. Maybe our post on myopic decision theory could be relevant.
Stuart Armstrong’s work on model splintering, especially his AI Safety Subprojects which are experimental, not obvious what they will find, and directly relevant to implementing and using model splintering to solve alignment
Paul Christiano’s recent work on making question-answerers give useful information instead of what they expect humans to answer, which has a clear success story for these kinds of powerful models and their use in building stronger AIs and supervising training for example.
It’s also important to remember how alignment and the related problems and ideas are still not that well explained, distilled and analyzed for teaching and criticism. So I’m excited too about work that isn’t directly solving alignment but just making things clearer and more explicit, like Evan’s recent post or my epistemic strategies analysis.
Thanks John for this whole thread!
(Note that I only read the whole Epistemology section of this post and skimmed the rest, so I might be saying stuff that are repeated/resolved elsewhere. Please point me to the relevant parts/quotes if that’s the case. ;) )
Einstein’s arrogance sounds to me like an early pointer in the Sequences for that kind of thing, with a specific claim about General Relativity being that kind of theory.
That being said, I still understand Richard’s position and difficulty with this whole part (or at least what I read of Richard’s difficulty). He’s coming from the perspective of philosophy of science, which has focused mostly on ideas related to advanced predictions and taking into account the mental machinery of humans to catch biases and mistakes that we systematically make. The Sequences also spend a massive amount of words on exactly this, and yet in this discussion (and in select points in the Sequences like the aforementioned post), Yudkowsky sounds a bit like considering that his fundamental theory/observation doesn’t need any of these to be accepted as obvious (I don’t think he is thinking that way, but that’s hard to extract out of the text).
It’s even more frustrating because Yudkowsky focuses on “showing epistemic modesty” as his answer/rebuttal to Richard’s inquiry, when Richard just sounds like he’s asking the completely relevant question “why should we take your word on it?” And the confusion IMO is because the last sentence sounds very status-y (How do you dare claiming such crazy stuff?), but I’m pretty convinced Richard actually means it in a very methodological/philosophy of science/epistemic strategies way of “What are the ways of thinking that you’re using here that you expect to be particularly good at aiming at the truth?”
Furthermore, I agree with (my model of) Richard that the main issue with the way Yudkowsky (and you John) are presenting your deep idea is that you don’t give a way of showing it wrong. For example, you (John) write:
It’s one of those predictions where, if it’s false, then we’ve probably discovered something interesting—most likely some place where an organism is spending resources to do something useful which we haven’t understood yet.
And even if I feel what you’re gesturing at, this sounds/looks like you’re saying “even if my prediction is false, that doesn’t mean that my theory would be invalidated”. Whereas I feel you want to convey something like “this is not a prediction/part of the theory that has the ability to falsify the theory” or “it’s part of the obvious wiggle room of the theory”. What I want is a way of finding the parts of the theory/model/prediction that could actually invalidate it, because that’s what we should be discussing really. (A difficulty might be that such theories are so fundamental and powerful than being able to see them makes it really hard to find any way they could go wrong and endanger the theory)
An analogy that comes to my mind is with the barriers for proving P vs NP. These make explicit ways in which you can’t solve the P vs NP question, such that it becomes far easier to weed proof attempts out. My impression is that You (Yudkowky and John) have models/generators that help you see at a glance that a given alignment proposal will fail. Which is awesome! I want to be able to find and extract and use those. But what Richard is pointing out IMO is that having the generators explicit would give us a way to stress test them, which is a super important step to start believing in them further. Just like we want people to actually try to go beyond GR, and for that they need to understand it deeply.
(Obviously, maybe the problem is that as you two are pointing it out, making the models/generators explicit and understandable is just really hard and you don’t know how to do that. That’s fair).
I totally agree.
This is also completely true when trying to talk to new people: being curious trumps all tricks and made-up sentences you can find. That was actually an epiphany for me, where I realized that my difficulty with starting conversation with new people stemmed from my lack of curiosity—just like you wrote, I thought I had them figured out. Turns out I’m not that great at figuring out people I never spoke to.
This post states the problem of gradient hacking. It is valuable in that this problem is far from obvious, and if plausible, very dangerous. On the other hand, the presentation doesn’t go into enough details, and so leaves gradient hacking open to attacks and confusion. Thus instead of just reviewing this post, I would like to clarify certain points, while interweaving my critics about the way gradient hacking was initially stated, and explaining why I consider this problem so important.
(Caveat: I’m not pretending that any of my objections are unknown to Evan. We’re working together on searching for experimental evidence of gradient hacking, and this review is heavily influenced by this collaboration.)
What is gradient hacking?
Evan defines it as:
the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way.
So we have a trained mesa-optimizer that is deceptively aligned (its mesa-objective differs from the base-objective it is trained on). This deceptive system finds a way to influence the gradient generated during training, and thus can basically ensure it never becomes inner aligned (among other things).
Before checking on how exactly this could be possible, we should think a bit more about what this implies.
If the deceptive model has arbitrary access to the gradient, then from that point on, the base-objective has only minimal influence on the training. Some influence remains because we probably want to test the end result, and thus it should work well with the base-objective. But that’s pretty much the only constraint left.
It could also pretty much deals with deception detectors because it can make itself not detectable:
That is, it could hack its own gradient to produce a new model which will still exhibit the same deceptive behavior, but without doing any of the detectable deceptive cognition. Perhaps the new model exhibits the deceptive behavior not because it’s trying to act deceptive, but just because it’s a hard-coded part of its algorithm, or because it has some bad proxy, or any other defect that might be more difficult to detect than outright deception.
To say it pithy: if gradient hacking happens, we’re fucked.
How could it happen, though? Well, that’s actually two distinct questions: how could a model gradient hack, and how could training create a model which gradient hacks. The post mostly focuses on the first one.
How could a model gradient hack?
The first example comes from a quoted footnote of Risks from Learned Optimization:
Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its mesa-objective untouched.
This implies that the model is able to check for changes in itself and in its objective, such that changing this objective would make it work worse. An analogue version would be for the model to underperform except if it’s changed in the direction it likes, to push the gradient towards there.
Still, it might feel a bit weird to think about a model that can influence its training without somewhat hacking the computer it is in. After all, the training is external, right? One fundamental but easy to miss insight from the post is that the model can influence the loss surface, and thus the gradient (which comes from this surface)
First, note that since gradient descent is a local optimization process, it doesn’t have to somehow change what the global minima might be, or even what the local minima are—it just has to create a ridge in the loss landscape that will push the gradient in the direction it wants.
So a model could gradient hacks because its output and how it computes its output influence the local loss surface around it, which means it could in theory make the gradient go its way. I feel like this is crucial, and it’s slightly problematic that the post doesn’t put more emphasis on it.
Also note that this is hardly solving the question. Let’s say the model is such that if its mesa-objective is changed, it performs badly. Then we probably have a form of gradient hacking. But what tells us that the change at this step will not slightly undermine the very mechanism that protects the mesa-objective? Unwrapping this thought, it’s harder to conceptualize how a model could gradient hack forever. It seems to involve some sort of fixed point or self correcting mechanism.
That complexity should probably make us update slightly towards the improbability of “sustainable” gradient hacking—still not enough to dismiss it though. Understanding the requirements for such sustainable gradient hacking looks like the most important research question on this topic for me (which makes sense, I’m working on it after all…).
How could training create a model which gradient hacks?
What about this second question? Well, another issue with the post comes from the lack of clarification around the second question: how could training create a model which gradient hacks? I’m obviously not asking for a full-fledged answer, as this is still research in progress. But there are easy to make mistakes that I believe Evan could have debunked in the post.
One such mistake is to get confused about how the model could “decide” to gradient hack. After all, if the computation is deterministic, it’s not like the model has a decision to make, right? This question can be dissolved in a similar way to free will (since it’s basically about free will of neural networks): (spoiler for the solution to free will)
the model is completely determined by its parameters, but its parameters are part of itself.
The less philosophical approach to this problem is to notice that the appearance of gradient hacking would probably come from the training stumbling on a gradient hacker. There is no need to worry about how the model could turn itself into a gradient hacker; the issue is about how training could bump into a gradient hacker.
Another mistake would be to argue that what is called gradient hacking here is indistinguishable in terms of training behavior from some model that doesn’t gradient hack but happens to have this gradient. This is missing a big part of Evan’s viewpoint: that looking inside the model is fundamental for alignment. Here the question is whether we can catch the model doing some form of recognizable gradient hacking, because that allows us to infer more about what it might do (and that it will not be inner aligned). Whereas if we can be relatively sure that a model is not gradient hacking, there are less things to worry about, at least in terms of deceptive alignment.
This is almost obvious for people that talked a lot with Evan on these issues, or who are intimate with his work; but I feel that it’s only implicit in this post, which might brew confusion. A new but very observant reader could infer it from the solution (discussed below) using transparency tools, sure. I still think the post would be stronger if that possible misinterpretation was debunked explicitly.
As for further research, investigating the appearance of gradient hacking through common ML training procedure is crucial. But it probably requires a better understanding of the mechanisms we’re looking for, and thus a convincing answer (at least informal) to the first question.
Possible solution
Evan proposes a solution for gradient hacking in one case:
So, that’s the problem—how do we solve it? Well, there’s a simple solution in the case above, which is just to run your transparency tools throughout training—not just at the end—so that you can catch the deception as soon as it happens.
This is hardly revolutionary, and doesn’t say how to actually catch deception (although most of Evan research is focused on that). Yet I don’t think much more can be done without understanding in more detail the mechanisms of gradient hacking, and how they appear through training. That is, without solving some significant part of the two previous questions.
Wrapping-up
All in all, I think this post is very important, because the alignment issue it presents is novel, counterintuitive, and hard to dismiss without more research, once one thinks about it for long enough. That being said, this post is definitely a short presentation of an idea, written quickly to draw attention to this issue. Gradient hacking warrants a longer post that digs deeper into the details of the problem,
This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.
What this gives us is a way of combining the output of many disparate epistemic strategies to get well structured and directly relevant knowledge about alignment and how our proposals would fare. This is great, because now, we can combine many different methods of investigation (theory arguments, philosophical approaches, empirical studies of analogous systems and problems) and try to tie them to a common narrative (pun intended) about alignment.
Of course, we should expect that some things we want to learn about don’t fit neatly in there, but training stories are still surprisingly inclusive. For example we could expect that reasoning about potential problems of AGI, in the very conceptual/philosophical/theoretical way we favor on the AF, doesn’t fit a framework focused on justifying a given approach. Yet training stories also includes the probing of their rationale, and finding a new problem/issue allows new probing and refinement, like the very theoretical computer science model presented by Paul in his research methodology post.
There is indeed one thing this post doesn’t get into: exactly which epistemic strategies can and should we use to argue for each part of a training story, and to break and falsify each. Still, I find that having a framing for combining and linking the output of the existing and new epistemic strategies is already quite an accomplishment. Plus it leaves me some work to do on clarifying and distilling the epistemic strategies of alignment.
Last but not least, I really like the name “story” for two reasons:
First this actually captures what most of these reasoning feel like. They’re not so much theories than narratives, and using the word story makes that clear and explicit.
But more importantly, “story” makes technical people feel uncomfortable. We immediately fear weird justification and biases towards believing interesting stories. And we should be wary of this when working on alignment, while acknowledging that most of our knowledge will take a form like that. So the word reminds us daily to not feel too comfortable with our ideas and intuitions, as we always risk falling for our own inventions.
Quick thought: I expect that the most effective donation would be to organizations funding independent researchers, notably the LTFF.
Note that I’m an independent researcher funded by the LTFF (and Beth Barnes), but even if you told me that the money would never go to me, I would still think that.
Grants by organizations like that have a good track record for producing valuable research, as at least two people I think are among the most interesting thinkers on the topic (John S. Wentworth and Steve Byrnes) have gotten grants from sources like that (Steve is technically funded by Beth Barnes with money from the donor lottery), and others I’m really excited about (like Alex Turner) were helped by LTFF grants.
Such grants allow researchers to both bootstrap their careers, and also explore less incentivized subjects related to alignment at the start of their career.
They are cheaper than funding a hire for somewhere like MIRI, ARC or CHAI.
This, and the fact that Chris Olah and Jack Clark are also leaving, makes me update towards OpenAI having more of a negative impact on AI risk. My biggest uncertainty for this update depend on what happens with Christiano and his team.
My good action of the day is to have fallen in the rabbit hole of discovering the justification behind your comment.
First, it’s more queueing theory than distributed systems theory (slightly pedantic, but I’m more used to the latter, which explained my lack of knowledge of this result).
Second, even if you look through Queueing theory resources, it’s not that obvious where to look. I’ve finally found a helpful blog post which basically explains how under basic models of queues the average latency behaves like , which leads to the following graph (utilization is used instead of load, but AFAIK these are the same things):
This post and a bunch of other places mentions 80% rather than 60%, but that’s not that important for the point IMO.
One thing I wonder is how this result changes with more complex queuing models, but I don’t have the time to look into it. Maybe this pdf (which also includes the maths for the mentioned derivation) has the answer.
Right now, the incentives to get useful feedback on my research push me to go into the opposite policy that I would like: publish on the AF as late as I can allow.
Ideally, I would want to use the AF as my main source of feedback, as it’s public, is read by more researchers that I know personally, and I feel that publishing there helps the field grow.
But I’m forced to admit that publishing anything on the AF means I can’t really send it to people anymore (because the ones I ask for feedback read the AF, so that’s feels wrong socially), and yet I don’t get any valuable feedback 99% of the time. More specifically, I don’t get any feedback 99% of the time. Whereas when I ask for feedback directly on a gdoc, I always end up with some useful remarks.
I also feel bad that I’m basically using a privileged policy, in the sense that a newcomer cannot use it.
Nonetheless, because I believe in the importance of my research, and I want to know if I’m doing stupid things or not, I’ll keep to this policy for the moment: never ever post something on the AF for which I haven’t already got all the useful feedback I could ask for.
This is an apology for the tone and the framing of the above comment (and my following answers), which have both been needlessly aggressive, status-focused and uncharitable. Underneath are still issues that matter a lot to me, but others have discussed them better (I’ll provide a list of linked comments at the end of this one).
Thanks to Richard Ngo for convincing me that I actually needed to write such an apology, which was probably the needed push for me to stop weaseling around it.
So what did I do wrong? The list is pretty damning:
I took something about the original post that I didn’t understand — EY’s “And then there is, so far as I can tell, a vast desert full of work that seems to me to be mostly fake or pointless or predictable.” — and because it didn’t make sense to me, and because that fitted with my stereotypes for MIRI and EY’s dismissiveness of a lot of work in alignment, I turned to an explanation of this as an attack on alignment researchers, saying they were consciously faking it when they knew they should do better. Whereas I feel know that what EY meant is far closer to alignment research at the moment is trying to try to align AI as best as we can, instead of just trying to do it. I’m still not sure if I agree with that characterization, but that sounds far more like something that can be discussed.
There’s also a weird aspect of status-criticism to my comment that I think I completely failed to explain. Looking at my motives now (let’s be wary of hindsight...), I feel like my issue with the status things was more that a bunch of people other than EY and MIRI just take what they say as super strong evidence without looking at all the arguments and details, and thus I expected this post and recent MIRI publications to create a background of “we’re doomed” for a lot of casual observers, with the force of the status of EY and MIRI.
But I don’t want to say that EY and MIRI are given too much status in general in the community, even if I actually wrote something along those lines. I guess it’s just easier to focus your criticism on the beacon of status than on the invisible crowd misusing status. Sorry about that.
I somehow turned that into an attack of MIRI’s research (at least a chunk of it), which didn’t really have anything to do with it. That probably was just the manifestation of my frustration when people come to the field and feel like they shouldn’t do the experimental research that they fill better suited for or feel like they need to learn a lot of advanced maths. Even if those are not official MIRI positions, I definitely feel MIRI has had a big influence on them. And yet, maybe newcomers should question themselves that way. It always sounded like a loss of potential to me, because the outcome is often to not do alignment; but maybe even if you’re into experiments, the best way you could align AIs now doesn’t go through that path (and you could still find that exciting enough to find new research).
Whatever the correct answer is, my weird ad-hominem attack has nothing to do with it, so I apologize for attacking all of MIRI’s research and their research agendas choice with it (even if I think talking more about what is and was the right choice still matters)
Part of my failure here has also been to not check for the fact that aggressive writing just feels snappier without much effort. I still think my paragraph starting with “When I’m not frustrated by this situation, I’m just sad.” works pretty well as an independent piece of writing, but it’s obviously needlessly aggressive and spicy, and doesn’t leave any room for the doubt that I actually felt or the doubts I should have felt. My answers after that comment are better, but still riding too much on that tone.
One of the saddest failure (pointed to me by Richard) is that by my tone and my presentation, I made it harder and more aversive for MIRI and EY to share their models, because they have to fear a bit more that kind of reaction. And even if Rob reacted really nicely, I expect that required a bunch of additional mental energy than a better comment wouldn’t have asked for.
So I apologize for that, and really want more model-building and discussions from MIRI and EY publicly.
So in summary, my comment should have been something along the line of “Hey, I don’t understand what are your generators for saying that all alignment research is ‘mostly fake or pointless or predictable’, could you give me some pointers to that”. I wasn’t in the head space or had the right handles to frame it that way and not go into weirdly aggressive tangents, and that’s on me.
On the plus side, every other comments on the thread has been high-quality and thoughtful, so here’s a list of the best ones IMO:
Ben Pace’s comment on what success stories for alignment would look like, giving examples.
Rob Bensinger’s comment about the directions of prosaic alignment I wrote I was excited about, and whether they’re “moving the dial”.
Rohin Shah’s comment which frames the outside view of MIRI I was pointing out better than I did and not aggressively.
John Wentworth’s two comments about the generators of EY’s pessimism being in the sequences all along.
Vaniver’s comment presenting an analysis of why some concrete ML work in alignment doesn’t seem to help for the AGI level.
Rob Bensinger’s comment drawing a great list of distinction to clarify the debate.