Thanks for writing this up. While I don’t have much context on what specifically has gone well or badly for your team, I do feel pretty skeptical about the types of arguments you give at several points: in particular focusing on theories of change, having the most impact, comparative advantage, work paying off in 10 years, etc. I expect that this kind of reasoning itself steers people away from making important scientific contributions, which are often driven by open-ended curiosity and a drive to uncover deep truths.
(A provocative version of this claim: for the most important breakthroughs, it’s nearly impossible to identify a theory of change for them in advance. Imagine Newton or Darwin trying to predict how understanding mechanics/evolution would change the world. Now imagine them trying to do that before they had even invented the theory! And finally imagine if they only considered plans that they thought would work within 10 years, and the sense of scarcity and tension that would give rise to.)
The rest of my comment isn’t directly about this post, but close enough that this seems like a reasonable place to put it. EDIT: to be more clear: the rest of this comment is not primarily about Neel or “pragmatic interpretability”, it’s about parts of the field that I consider to be significantly less relevant to “solving alignment” than that (though work that’s nominally on pragmatic interpretability could also fall into the same failure modes). I clarify my position further in this comment; thanks Rohin for the pushback.
I get the sense that there was a “generation” of AI safety researchers who have ended up with a very marginalist mindset about AI safety. Some examples:
the evals that Beth Barnes (and maybe Dan Hendrycks?) are focusing on
the scenarios that Daniel Kokotajlo is focusing on
the models of misalignment that Evan Hubinger is focusing on
the forecasting that the OpenPhil worldview investigations team focused on
scary demos
safety cases
policy approaches like SB-1047
In other words, whole swathes of the field are not even aspiring to be the type of thing that could solve misalignment. In the terminology of this excellent post, they are all trying to attack a category I problem not a category II problem. Sometimes it feels like almost the entire field EDIT: most of the field is Goodharting on the subgoal of “write a really persuasive memo to send to politicians”. Pragmatic interpretability feels like another step in that direction (EDIT: but still significantly more principled than the things I listed above).
This is all related to something Buck recently wrote: “I spend most of my time thinking about relatively cheap interventions that AI companies could implement to reduce risk assuming a low budget, and about how to cause AI companies to marginally increase that budget”. I’m sure Buck has thought a lot about his strategy here, and I’m sure that you’ve thought a lot about your strategy as laid out in this post, and so on. But a part of me is sitting here thinking: man, everyone sure seems to have given up. (And yes, I know it doesn’t feel like giving up from the inside, but from my perspective that’s part of the problem.)
Now, a lot of the “old guard” seems to have given up too. But they at least know what they’ve given up on. There was an ideal of fundamental scientific progress that MIRI and Paul and a few others were striving towards; they knew at least what it would feel like (if not what it would look like) to actually make progress towards understanding intelligence. Eliezer and various others no longer think that’s plausible. I disagree. But aside from the object-level disagreement, I really want people to be aware that this is a thing that’s at least possible in principle to aim for, lest the next generation of the AI safety community ends up giving up on it before they even know what they’ve given up on.
(I’ll leave for another comment/post the question of what went wrong in my generation. The “types of arguments” I objected to above all seem quite EA-flavored, and so one salient possibility is just that the increasing prominence of EA steered my generation away from the type of mentality in which it’s even possible to aim towards scientific breakthroughs. But even if that’s one part of the story, I expect it’s more complicated than that.)
I wish when you wrote these comments you acknowledged that some people just actually think that we can substantially reduce risk via what you call “marginalist” approaches. Not everyone agrees that you have to deeply understand intelligence from first principles else everyone dies. (EDIT: See Richard’s clarification downthread.) Depending on how you choose your reference class, I’d guess most people disagree with that.
Imo the vast, vast majority of progress in the world happens via “marginalist” approaches, so if you do think you can win via “marginalist” approaches you should generally bias towards them.
Yeah, that’s basically my take—I don’t expect anything to “solve” alignment, but I think we can achieve major risk reductions by marginalist approaches. Maybe we can also achieve even more major risk reductions with massive paradigm shifts, or maybe we just waste a ton of time, I don’t know.
Its worth disambiguating two critiques in Richards comment:
1) the AI safety community doesn’t try to fundamentally understand intelligence 2) the AI safety community doesn’t try to solve alignment for smarter than human AI systems
Tbc, they are somewhat related (i.e. people trying to fundamentally understand intelligence tend to think about alignment more) but clearly distinct. The “mainstream” AI safety crowd (myself included) is much more sympathetic to 2 than 1 (indeed Neel has said as much).
There’s something to the idea that “marginal progress doesn’t fee like marginal progress from the inside”. Like, even if no one breakthrough or discovery “solves alignment”, a general frame of “lets find principled approaches” is often more generative than “let’s find the cheapest 80⁄20 approach” (both can be useful, and historically the safety community has probably leaned too far towards principled, but maybe the current generation is leaning too far the other way)
2) the AI safety community doesn’t try to solve alignment for smarter than human AI systems
I assume you’re referring to “whole swathes of the field are not even aspiring to be the type of thing that could solve misalignment”.
Imo, chain of thought monitoring, AI control, amplified oversight, MONA, reasoning model interpretability, etc, are all things that could make the difference between “x-catastrophe via misalignment” and “no x-catastrophe via misalignment”, so I’d say that lots of our work could “solve misalignment”, though not necessarily in a way where we can know that we’ve solved misalignment in advance.
Based on Richard’s previous writing (e.g. 1, 2) I expect he sees this sort of stuff as not particularly interesting alignment research / doesn’t really help, so I jumped ahead in the conversation to that disagreement.
even if no one breakthrough or discovery “solves alignment”, a general frame of “lets find principled approaches” is often more generative than “let’s find the cheapest 80⁄20 approach”
Sure, I broadly agree with this, and I think Neel would too. I don’t see Neel’s post as disagreeing with it, and I don’t think the list of examples that Richard gave is well described as “let’s find the cheapest 80⁄20 approach”.
I think me using the word “marginalist” was probably a mistake, because it conflates two distinct things that I’m skeptical about:
People no longer trying to make models more aligned (but e.g. trying to do work that primarily cashes out in political outcomes). This is what I mean by “not even aspiring to be the type of thing that could solve alignment”.
People using engineering-type approaches (rather than science-type approaches) to try to make models more aligned.
The list I gave above was of things that fall into category 1, whereas (almost?) all of the things you named fall into category 2. What I want more of is category 3: science-type approaches. One indicator that something is a science-type approach is that it could potentially help us understand something fundamental about intelligence; another is that, if it works, we’ll know in advance (I used to not care about this, but have changed my mind).
I think there are versions of most of the things you named that could be in category 3, but people mostly seem to be doing category-2 versions of them, in significant part because of the sort of EA-style reasoning that I was criticizing from Neel’s original post.
When I wrote “pragmatic interpretability feels like another step in that direction” I meant something like: ambitious interpretability was trying to do 3, and pragmatic interpretability seems like it’s nominally trying to do 2, and may in practice end up being mostly 1. For example, “Stop models acting differently when tested” could be a part of an engineering-type pipeline for fixing misalignments in models, but could also end up drifting towards “help us get better evidence to convince politicians and lab leaders of things”. However, I’m not claiming that pragmatic interpretability is a central example of “not even aspiring to be the type of thing that could solve alignment”. Apologies for the bad phrasings.
Makes sense, I still endorse my original comment in light of this answer (as I already expected something like this was your view). Like, I would now say
Imo the vast, vast majority of progress in the world happens via “engineering-type / category 2” approaches, so if you do think you can win via “engineering-type / category 2″ approaches you should generally bias towards them
while also noting that the way we are using the phrase “engineering-type” here includes a really large amount of what most people would call “science” (e.g. it includes tons of academic work), so it is important when evaluating this claim to interpret the words “engineering” and “science” in context rather than via their usual connotations.
Yepp, makes sense, and it’s a good reminder for me to be careful about how I use these terms.
One clarification I’d make to your original comment though is that I don’t endorse “you have to deeply understand intelligence from first principles else everyone dies”. My position is closer to “you have to be trying to do something principled in order for your contribution to be robustly positive”. Relatedly, agent foundations and mech-interp are approximately the only two parts of AI safety that seem robustly good to me—with a bunch of other stuff like RLHF, or evals, or (almost all) governance work, I feel pretty confused about whether they’re good or bad or basically just wash out even in expectation.
This is still consistent with risk potentially being reduced by what I call engineering-type work, it’s just that IMO that involves us “getting lucky” in an important way which I prefer we not rely on. (And trying to get lucky isn’t a neutral action—engineering-type work can also easily have harmful effects.)
Fair, I’ve edited the comment with a pointer. It still seems to me to be a pretty direct disagreement with “we can substantially reduce risk via [engineering-type / category 2] approaches”.
My claim is “while it certainly could be net negative (as is also the case for ~any action including e.g. donating to AMF), in aggregate it is substantially positive expected risk reduction”.
Your claim in opposition seems to be “who knows what the sign is, we should treat it as an expected zero risk reduction”.
Though possibly you are saying “it’s bad to take actions that have a chance of backfiring, we should focus much more on robustly positive things” (because something something virtue ethics?), in which case I think we have a disagreement on decision theory instead.
I still want to claim that in either case, my position is much more common (among the readership here), except inasmuch as they disagree because they think alignment is very hard and that’s why there’s expected zero (or negative) risk reduction. And so I wish you’d flag when your claims depend on these takes (though I realize it is often hard to notice when that is the case).
I expect it’s not worth our time to dig too deep into whose position is more common here. But I think that a lot of people on LW have high P(doom) in significant part because they share my intuition that marginalist approaches don’t reliably work. I do agree that my combination of “marginalist approaches don’t reliably improve things” and “P(doom) is <50%” is a rare one, but I was only making the former point above (and people upvoted it accordingly), so it feels a bit misleading to focus on the rareness of the overall position.
(Interestingly, while the combination I describe above is a rare one, the converse is also rare—Daniel Kokotajlo is the only person who comes to mind who disagrees with me on both of these propositions simultaneously. Note that he doesn’t characterize his current work as marginalist, but even aside from that question I think this characterization of him is accurate—e.g. he has talked to me about how changing the CEO of a given AI lab could swing his P(doom) by double digit percentage points.)
On reflection, it’s not actually about which position is more common. My real objection is that imo it was pretty obvious that something along these lines would be the crux between you and Neel (and the fact that it is a common position is part of why I think it was obvious).
Inasmuch as you are actually trying to have a conversation with Neel or address Neel’s argument on its merits, it would be good to be clear that this is the crux. I guess perhaps you might just not care about that and are instead trying to influence readers without engaging with the OP’s point of view, in which case fair enough. Personally I would find that distasteful / not in keeping with my norms around collective-epistemics but I do admit it’s within LW norms.
(Incidentally, I feel like you still aren’t quite pinning down your position—depending on what you mean by “reliably” I would probably agree with “marginalist approaches don’t reliably improve things”. I’d also agree with “X doesn’t reliably improve things” for almost any interesting value of X.)
Inasmuch as you are actually trying to have a conversation with Neel or address Neel’s argument on its merits, it would be good to be clear that this is the crux.
The first two paragraphs of my original comment were trying to do this. The rest wasn’t. I flagged this in the sentence “The rest of my comment isn’t directly about this post, but close enough that this seems like a reasonable place to put it.” However, I should have been clearer about the distinction. I’ve now added the following:
EDIT: to be more clear: the rest of this comment is not primarily about Neel or “pragmatic interpretability”, it’s about parts of the field that I consider to be significantly less relevant to “solving alignment” than that (though work that’s nominally on pragmatic interpretability could also fall into the same failure modes). I clarify my position further in this comment; thanks Rohin for the pushback.
Reflecting further, I think there are two parts of our earlier exchange that are a bit suspicious. The first is when I say that everyone seems to have “given up” (rather than something more nuanced like “given up on tackling the most fundamental aspects of the problem”). The second is where you summarize my position as being that we need deep scientific understanding or else everyone dies (which I think you can predict is a pretty unlikely position for me in particular to hold).
So what’s going on here? It feels like we’re both being “anchored” by extreme positions. You were rounding me off to doomerism, and I was rounding the marginalists off to “giving up”. Both I’d guess are artifacts of writing quickly and a bit frustratedly. Probably I should write a full post or shortform that characterizes more precisely what “giving up” is trying to point to.
(Incidentally, I feel like you still aren’t quite pinning down your position—depending on what you mean by “reliably” I would probably agree with “marginalist approaches don’t reliably improve things”. I’d also agree with “X doesn’t reliably improve things” for almost any interesting value of X.)
My instinctive reaction is that this depends a lot on whether by “marginalist approaches” we mean something closer to “a single marginalist approach” or “the set of all people pursuing marginalist approaches”. I think we both agree that no single marginalist approach (e.g. investigating a given technique) makes reliable progress. However, I’d guess that I’m more willing than you to point to a broad swathe of people pursuing marginalist approaches and claim that they won’t reliably improve things.
The first two paragraphs of my original comment were trying to do this.
(I have the same critique of the first two paragraphs, but thanks for the edit, it helps)
The second is where you summarize my position as being that we need deep scientific understanding or else everyone dies (which I think you can predict is a pretty unlikely position for me in particular to hold).
Fwiw, I am actively surprised that you have a p(doom) < 50%, I can name several lines of evidence in the opposite direction:
You’ve previously tried to define alignment based on worst-case focus and scientific approach. This suggests you believe that “marginalist” / “engineering” approaches are ~useless, from which I inferred (incorrectly) that you would have a high p(doom).
I still find the conjunction of the two positions you hold pretty weird.
I’m a strong believer in logistic success curves for complex situations. If you’re in the middle part of a logistic success curve in a complex situation, then there should be many things that can be done to improve the situation, and it seems like “engineering” approaches should work.
It’s certainly possible to have situations that prevent this. Maybe you have a bimodal distribution, e.g. 70% on “near-guaranteed fine by default” and 30% on “near-guaranteed doom by default”. Maybe you think that people have approximately zero ability to tell which things are improvements. Maybe you think we are at the far end of the logistic success curve today, but timelines are long and we’ll do the necessary science in time. But these views seem kinda exotic and unlikely to be someone’s actual views. (Idk maybe you do believe the second one.)
Obviously I had not thought through this in detail when I originally wrote my comment, and my wordless inference was overconfident in hindsight. But I stand by my overall sense that a person who thinks “engineering” approaches are near-useless will likely also have high p(doom) -- not just as a sociological observation, but also as a claim about which positions are consistent with each other.
In your writing you sometimes seem to take as a background assumption that alignment will be very hard. For example, I recall you critiquing assistance games because (my paraphrase) “that’s not what progress on a hard problem looks like”. (I failed to dig up the citation though.)
You’re generally taking a strategy that appears to me to be high variance, which people usually justify via high p(doom) / playing to your outs.
A lot of your writing is similarly flavored to other people who have high p(doom).
In terms of evidence that you have a p(doom) < 50%, I think the main thing that comes to mind is that you argued against Eliezer about this in late 2021, but that was quite a while ago (relative to the evidence above) and I thought you had changed your mind. (Also iirc the stuff you said then was consistent with p(doom) ~ 50%, but it’s long enough ago that I could easily be forgetting things.)
However, I’d guess that I’m more willing than you to point to a broad swathe of people pursuing marginalist approaches and claim that they won’t reliably improve things.
You could point to ~any reasonable subcommunity within AI safety (or the entire community) and I’d still be on board with the claim that there’s at least a 10% chance that will make things worse, which I might summarize as “they won’t reliably improve things”, so I still feel like this isn’t quite capturing the distinction. (I’d include communities focused on “science” in that, but I do agree that they are more likely not to have a negative sign.) So I still feel confused about what exactly your position is.
This is all related to something Buck recently wrote: “I spend most of my time thinking about relatively cheap interventions that AI companies could implement to reduce risk assuming a low budget, and about how to cause AI companies to marginally increase that budget”. I’m sure Buck has thought a lot about his strategy here, and I’m sure that you’ve thought a lot about your strategy as laid out in this post, and so on. But a part of me is sitting here thinking: man, everyone sure seems to have given up. (And yes, I know it doesn’t feel like giving up from the inside, but from my perspective that’s part of the problem.)
Thanks for pointing this out.
I’ve been thinking lately about how much folks around more or less dismiss the idea of an AI pause as unrealistic because we’re not going to get that much political buyin.
I (speculatively) think that this is a bit trapped in a mindset that is assuming the conclusion. Big political changes like that one have happened in the past, and they have often seemed impossible before they happened and inevitable in retrospect. And, when something big like that changes, part of the process is a cascade, where whole deferral structures change their mind / attitude / preferences, about something. How much buyin you have before that cascade happens may not be very indicative of where that cascade can end up.
I, personally, don’t feel like I know how to “call it” when big changes are on the table or when they’re not. But it sure does seem like people are counting us out much too early, given the fundamentals of the situation. We all think that the world is going to change very radically in the next few years. It’s not clear what kinds of cascades are on the table.
I provisionally think that we should feel less bashful about advocating for an AI pause, and more agnostic about how likely that is to come to pass.
I agree with you, but also think you’re not going far enough. In a world where things are changing radically, the space of possibilities opens up dramatically. And so it’s less a question of “does advocating for policy X become viable?”, and more a question of “how can we design the kinds of policies that our past selves wouldn’t even have been able to conceive of?”
In other words, in a world that’s changing a lot, you want to avoid privileging your hypotheses in advance, which is what it feels like the “pro AI pause vs anti AI pause” debate is doing.
(And yes, in some sense those radical future policies might fall into a broad category like “AI pause”. But that doesn’t mean that our current conception of “AI pause” is a very useful guide for how to make those future policies come about.)
Whoa, you think the scenarios I’m focusing on are marginalist? I didn’t expect you to say that. I generally think of what we are doing as (a) forecasting and (b) making ambitious solve-approximately-all-the-problems plans to present to the world. Forecasting isn’t marginalist, it’s a type error to think so, and as for our plans, well, they seem pretty ambitious to me.
I regret using the word “marginalist”, it’s a bit too confusing. But I do have a pretty high bar for what counts as “ambitious” in the political domain—it involves not just getting the system to do something, but rather trying to change the system itself. Cummings and Thiel are central examples (Geoff Anders maybe also was aiming in that direction at one point).
(I’ll leave for another comment/post the question of what went wrong in my generation. The “types of arguments” I objected to above all seem quite EA-flavored, and so one salient possibility is just that the increasing prominence of EA steered my generation away from the type of mentality in which it’s even possible to aim towards scientific breakthroughs. But even if that’s one part of the story, I expect it’s more complicated than that.)
I’m reminded of Patrick Collison’s (I now think, quite wise) comment on EA:
Now if the question is, should everyone be an EA or even, I guess in the individual sense, am I or do I think I should be an EA? I think – and obviously there’s kind of heterogeneity within the field – but my general sense is that the EA movement is always very focused on kind of rigid, not rigid, that’s that’s unfair perhaps, but on sort of, estimation, analytical, quantification, and sort of utilitarian calculation, and I think that that as a practical matter that means that you end up too focused on that which you can measure, which again means – or as a practical matter means – you’re too focus on things that are sort of short-term like bed nets or deworming or whatever being obvious examples. And are those good causes? I would say almost definitely yes, obviously. Now we’ve seen some new data over the last couple of years that maybe they’re not as good as they initially seemed but they’re very likely to be really good things to do.
But it’s hard for me to see how, you know, writing a treatise of human nature would score really highly in an EA oriented framework. As assessed ex-post that looked like a really valuable thing for Hume to do. And similarly, as we have a look at the things that in hindsight seem like very good things to have happen in the world, it’s often unclear to me how an EA oriented intuition might have caused somebody to do so. And so I guess I think of EA as sort of like a metal detector, and they’ve invented a new kind of metal detector that’s really good at detecting some metals that other detectors are not very good at detecting. But I actually think we need some diversity in the different metallic substances which our detectors are attuned to, and for me EA would not be the only one.
I expect that this kind of reasoning itself steers people away from making important scientific contributions, which are often driven by open-ended curiosity and a drive to uncover deep truths.
I agree with this statement denotatively, and my own interests/work have generally been “driven by open-ended curiosity and a drive to uncover deep truths”, but isn’t this kind of motivation also what got humanity into its current mess? In other words, wasn’t the main driver of AI progress this kind of curiosity (until perhaps the recent few years when it has been driven more by commercial/monetary/power incentives)?
I would hesitate to encourage more people to follow their own curiosity, for this reason, even people who are already in AI safety research, due to the consideration of illegible safety problems, which can turn their efforts net-negative if they’re being insufficiently strategic (which seems hard to do while also being driven mainly by curiosity).
I think I’ve personally been lucky, or skilled in some way that I don’t understand, in that my own curiosity has perhaps been more aligned with what’s good than most people’s, but even some of my interests, e.g. in early cryptocurrency, might have been net-negative.
I guess this is related to our earlier discussion about how important being virtuous is to good strategy/prioritization, and my general sense is that consistently good strategy requires a high amount of consequentialist reasoning, because the world is too complicated and changes too much and too frequently to rely on pre-computed shortcuts. It’s hard for me to understand how largely intuitive/nonverbal virtues/curiosity could be doing enough “compute” or “reasoning” to consistently output good strategy.
I agree with this statement denotatively, and my own interests/work have generally been “driven by open-ended curiosity and a drive to uncover deep truths”, but isn’t this kind of motivation also what got humanity into its current mess? In other words, wasn’t the main driver of AI progress this kind of curiosity (until perhaps the recent few years when it has been driven more by commercial/monetary/power incentives)?
Interestingly, I was just having a conversation with Critch about this. My contention was that, in the first few decades of the field, AI researchers were actually trying to understand cognition. The rise of deep learning (and especially the kind of deep learning driven by massive scaling) can be seen as the field putting that quest on hold in order to optimize for more legible metrics.
I don’t think you should find this a fully satisfactory answer, because it’s easy to “retrodict” ways that my theory was correct. But that’s true of all explanations of what makes the world good at a very abstract level, including your own answer of metaphilosophical competence. (Also, we can perhaps cash my claim out in predictions, like: was a significant barrier to more researchers working on deep learning the criticism that it didn’t actually provide good explanations of or insight into cognition? Without having looked it up, I suspect so.)
consistently good strategy requires a high amount of consequentialist reasoning
I don’t think that’s true. However I do think it requires deep curiosity about what good strategy is and how it works. It’s not a coincidence that my own research on a theory of coalitional agency was in significant part inspired by strategic failures of EA and AI safety (with this post being one of the earliest building blocks I laid down). I also suspect that the full theory of coalitional agency will in fact explain how to do metaphilosophy correct, because doing good metaphilosophy is ultimately a cognitive process and can therefore be characterized by a sufficiently good theory of cognition.
Again, I don’t expect you to fully believe me. But what I most want to read from you right now is an in-depth account of which the things in the world have gone or are going most right, and the ways in which you think metaphilosophical competence or consequentialist reasoning contributed to them. Without that, it’s hard to trust metaphilosophy or even know what it is (though I think you’ve given a sketch of this in a previous reply to me at some point).
I should also try to write up the same thing, but about how virtues contributed to good things. And maybe also science, insofar as I’m trying to defend doing more science (of cognition and intelligence) in order to help fix risks caused by previous scientific progress.
But what I most want to read from you right now is an in-depth account of which the things in the world have gone or are going most right, and the ways in which you think metaphilosophical competence or consequentialist reasoning contributed to them.
(First a terminological note: I wouldn’t use the phrase “metaphilosophical competence”, and instead tend to talk about either “metaphilosophy”, meaning studying the nature of philosophy and philosophical reasoning, how should philosophical problems be solved, etc., or “philosophical competence”, meaning how good someone is at solving philosophical problems or doing philosophical reasoning. And sometimes I talk about them together, like in “metaphilosophy / AI philosophical competence” because I think solving metaphilosophy is the best way to improve AI philosophical competence. Here I’ll interpret you to just mean “philosophical competence”.)
To answer your question, it’s pretty hard to think of really good examples, I think because humans are very bad at both philosophical competence and consequentialist reasoning, but here are some:
the game theory around nuclear deterrence, helping to prevent large-scale war so far
economics and its influence on government policy, e.g., providing support for property rights, markets, and regulations around things like monopolies and externalities (but it’s failing pretty badly on AGI/ASI)
analytical philosophy making philosophical progress in so far as asking important questions and delineating various plausible answers (but doing badly as far as individually having inappropriate levels of confidence, as well as failing to focus on the really important problems, e.g., related to AI safety)
certain philosophers / movements (rationalists, EA) emphasizing philosophical (especially moral) uncertainty to some extent, and realizing the importance of AI safety
MIRI updating on evidence/arguments and pivoting strategy in response (albeit too slowly)
I guess this isn’t an “in-depth account” but I’m also not sure why you’re asking for “in-depth”, i.e., why doesn’t a list like this suffice?
I should also try to write up the same thing, but about how virtues contributed to good things.
I think non-consequentialist reasoning or ethics probably worked better in the past, when the world changed more slowly and we had more chances to learn from our mistakes (and refine our virtues/deontology over time), so I wouldn’t necessarily find this kind of writing very persuasive, unless it somehow addressed my central concern that virtues do not seem to be a kind of thing that is capable of doing enough “compute/reasoning” to find consistently good strategies in a fast changing environment on the first try.
To answer your question, it’s pretty hard to think of really good examples, I think because humans are very bad at both philosophical competence and consequentialist reasoning, but here are some:
If this is true, then it should significantly update us away from the strategy “solve our current problems by becoming more philosophically competent and doing good consequentialist reasoning”, right? If you are very bad at X, then all else equal you should try to solve problems using strategies that don’t require you to do much X.
You might respond that there are no viable strategies for solving our current problems without applying a lot of philosophical competence and consequentialist reasoning. I think scientific competence and virtue ethics are plausibly viable alternative strategies (though the line between scientific and philosophical competence seems blurry to me, as I discuss below). But even given that we disagree on that, humanity solved many big problems in the past without using much philosophical competence and consequentialist reasoning, so it seems hard to be confident that we won’t solve our current problems in other ways.
Out of your examples, the influence of economics seems most solid to me. I feel confused about whether game theory itself made nuclear war more or less likely—e.g. von Neumann was very aggressive, perhaps related to his game theory work, and maybe MAD provided an excuse to stockpile weapons? Also the Soviets didn’t really have the game theory IIRC.
On the analytical philosophy front, the clearest wins seem to be cases where they transitioned from doing philosophy to doing science or math—e.g. the formalization of probability (and economics to some extent too). If this is the kind of thing you’re pointing at, then I’m very much on board—that’s what I think we should be doing for ethics and intelligence. Is it?
Re the AI safety stuff: it all feels a bit too early to say what its effects on the world have been (though on net I’m probably happy it has happened).
I guess this isn’t an “in-depth account” but I’m also not sure why you’re asking for “in-depth”, i.e., why doesn’t a list like this suffice?
Because I have various objections to this list (some of which are detailed above) and with such a succinct list it’s hard to know which aspects of them you’re defending, which arguments for their positive effects you find most compelling, etc.
Thanks for writing this up. While I don’t have much context on what specifically has gone well or badly for your team, I do feel pretty skeptical about the types of arguments you give at several points: in particular focusing on theories of change, having the most impact, comparative advantage, work paying off in 10 years, etc. I expect that this kind of reasoning itself steers people away from making important scientific contributions, which are often driven by open-ended curiosity and a drive to uncover deep truths.
(A provocative version of this claim: for the most important breakthroughs, it’s nearly impossible to identify a theory of change for them in advance. Imagine Newton or Darwin trying to predict how understanding mechanics/evolution would change the world. Now imagine them trying to do that before they had even invented the theory! And finally imagine if they only considered plans that they thought would work within 10 years, and the sense of scarcity and tension that would give rise to.)
The rest of my comment isn’t directly about this post, but close enough that this seems like a reasonable place to put it. EDIT: to be more clear: the rest of this comment is not primarily about Neel or “pragmatic interpretability”, it’s about parts of the field that I consider to be significantly less relevant to “solving alignment” than that (though work that’s nominally on pragmatic interpretability could also fall into the same failure modes). I clarify my position further in this comment; thanks Rohin for the pushback.
I get the sense that there was a “generation” of AI safety researchers who have ended up with a very marginalist mindset about AI safety. Some examples:
the evals that Beth Barnes (and maybe Dan Hendrycks?) are focusing on
the scenarios that Daniel Kokotajlo is focusing on
the models of misalignment that Evan Hubinger is focusing on
the forecasting that the OpenPhil worldview investigations team focused on
scary demos
safety cases
policy approaches like SB-1047
In other words, whole swathes of the field are not even aspiring to be the type of thing that could solve misalignment. In the terminology of this excellent post, they are all trying to attack a category I problem not a category II problem. Sometimes it feels like
almost the entire fieldEDIT: most of the field is Goodharting on the subgoal of “write a really persuasive memo to send to politicians”. Pragmatic interpretability feels like another step in that direction (EDIT: but still significantly more principled than the things I listed above).This is all related to something Buck recently wrote: “I spend most of my time thinking about relatively cheap interventions that AI companies could implement to reduce risk assuming a low budget, and about how to cause AI companies to marginally increase that budget”. I’m sure Buck has thought a lot about his strategy here, and I’m sure that you’ve thought a lot about your strategy as laid out in this post, and so on. But a part of me is sitting here thinking: man, everyone sure seems to have given up. (And yes, I know it doesn’t feel like giving up from the inside, but from my perspective that’s part of the problem.)
Now, a lot of the “old guard” seems to have given up too. But they at least know what they’ve given up on. There was an ideal of fundamental scientific progress that MIRI and Paul and a few others were striving towards; they knew at least what it would feel like (if not what it would look like) to actually make progress towards understanding intelligence. Eliezer and various others no longer think that’s plausible. I disagree. But aside from the object-level disagreement, I really want people to be aware that this is a thing that’s at least possible in principle to aim for, lest the next generation of the AI safety community ends up giving up on it before they even know what they’ve given up on.
(I’ll leave for another comment/post the question of what went wrong in my generation. The “types of arguments” I objected to above all seem quite EA-flavored, and so one salient possibility is just that the increasing prominence of EA steered my generation away from the type of mentality in which it’s even possible to aim towards scientific breakthroughs. But even if that’s one part of the story, I expect it’s more complicated than that.)
I wish when you wrote these comments you acknowledged that some people just actually think that we can substantially reduce risk via what you call “marginalist” approaches. Not everyone agrees that you have to deeply understand intelligence from first principles else everyone dies. (EDIT: See Richard’s clarification downthread.) Depending on how you choose your reference class, I’d guess most people disagree with that.
Imo the vast, vast majority of progress in the world happens via “marginalist” approaches, so if you do think you can win via “marginalist” approaches you should generally bias towards them.
Yeah, that’s basically my take—I don’t expect anything to “solve” alignment, but I think we can achieve major risk reductions by marginalist approaches. Maybe we can also achieve even more major risk reductions with massive paradigm shifts, or maybe we just waste a ton of time, I don’t know.
Its worth disambiguating two critiques in Richards comment:
1) the AI safety community doesn’t try to fundamentally understand intelligence
2) the AI safety community doesn’t try to solve alignment for smarter than human AI systems
Tbc, they are somewhat related (i.e. people trying to fundamentally understand intelligence tend to think about alignment more) but clearly distinct. The “mainstream” AI safety crowd (myself included) is much more sympathetic to 2 than 1 (indeed Neel has said as much).
There’s something to the idea that “marginal progress doesn’t fee like marginal progress from the inside”. Like, even if no one breakthrough or discovery “solves alignment”, a general frame of “lets find principled approaches” is often more generative than “let’s find the cheapest 80⁄20 approach” (both can be useful, and historically the safety community has probably leaned too far towards principled, but maybe the current generation is leaning too far the other way)
I assume you’re referring to “whole swathes of the field are not even aspiring to be the type of thing that could solve misalignment”.
Imo, chain of thought monitoring, AI control, amplified oversight, MONA, reasoning model interpretability, etc, are all things that could make the difference between “x-catastrophe via misalignment” and “no x-catastrophe via misalignment”, so I’d say that lots of our work could “solve misalignment”, though not necessarily in a way where we can know that we’ve solved misalignment in advance.
Based on Richard’s previous writing (e.g. 1, 2) I expect he sees this sort of stuff as not particularly interesting alignment research / doesn’t really help, so I jumped ahead in the conversation to that disagreement.
Sure, I broadly agree with this, and I think Neel would too. I don’t see Neel’s post as disagreeing with it, and I don’t think the list of examples that Richard gave is well described as “let’s find the cheapest 80⁄20 approach”.
I think me using the word “marginalist” was probably a mistake, because it conflates two distinct things that I’m skeptical about:
People no longer trying to make models more aligned (but e.g. trying to do work that primarily cashes out in political outcomes). This is what I mean by “not even aspiring to be the type of thing that could solve alignment”.
People using engineering-type approaches (rather than science-type approaches) to try to make models more aligned.
The list I gave above was of things that fall into category 1, whereas (almost?) all of the things you named fall into category 2. What I want more of is category 3: science-type approaches. One indicator that something is a science-type approach is that it could potentially help us understand something fundamental about intelligence; another is that, if it works, we’ll know in advance (I used to not care about this, but have changed my mind).
I think there are versions of most of the things you named that could be in category 3, but people mostly seem to be doing category-2 versions of them, in significant part because of the sort of EA-style reasoning that I was criticizing from Neel’s original post.
When I wrote “pragmatic interpretability feels like another step in that direction” I meant something like: ambitious interpretability was trying to do 3, and pragmatic interpretability seems like it’s nominally trying to do 2, and may in practice end up being mostly 1. For example, “Stop models acting differently when tested” could be a part of an engineering-type pipeline for fixing misalignments in models, but could also end up drifting towards “help us get better evidence to convince politicians and lab leaders of things”. However, I’m not claiming that pragmatic interpretability is a central example of “not even aspiring to be the type of thing that could solve alignment”. Apologies for the bad phrasings.
Makes sense, I still endorse my original comment in light of this answer (as I already expected something like this was your view). Like, I would now say
while also noting that the way we are using the phrase “engineering-type” here includes a really large amount of what most people would call “science” (e.g. it includes tons of academic work), so it is important when evaluating this claim to interpret the words “engineering” and “science” in context rather than via their usual connotations.
Yepp, makes sense, and it’s a good reminder for me to be careful about how I use these terms.
One clarification I’d make to your original comment though is that I don’t endorse “you have to deeply understand intelligence from first principles else everyone dies”. My position is closer to “you have to be trying to do something principled in order for your contribution to be robustly positive”. Relatedly, agent foundations and mech-interp are approximately the only two parts of AI safety that seem robustly good to me—with a bunch of other stuff like RLHF, or evals, or (almost all) governance work, I feel pretty confused about whether they’re good or bad or basically just wash out even in expectation.
This is still consistent with risk potentially being reduced by what I call engineering-type work, it’s just that IMO that involves us “getting lucky” in an important way which I prefer we not rely on. (And trying to get lucky isn’t a neutral action—engineering-type work can also easily have harmful effects.)
Fair, I’ve edited the comment with a pointer. It still seems to me to be a pretty direct disagreement with “we can substantially reduce risk via [engineering-type / category 2] approaches”.
My claim is “while it certainly could be net negative (as is also the case for ~any action including e.g. donating to AMF), in aggregate it is substantially positive expected risk reduction”.
Your claim in opposition seems to be “who knows what the sign is, we should treat it as an expected zero risk reduction”.
Though possibly you are saying “it’s bad to take actions that have a chance of backfiring, we should focus much more on robustly positive things” (because something something virtue ethics?), in which case I think we have a disagreement on decision theory instead.
I still want to claim that in either case, my position is much more common (among the readership here), except inasmuch as they disagree because they think alignment is very hard and that’s why there’s expected zero (or negative) risk reduction. And so I wish you’d flag when your claims depend on these takes (though I realize it is often hard to notice when that is the case).
I expect it’s not worth our time to dig too deep into whose position is more common here. But I think that a lot of people on LW have high P(doom) in significant part because they share my intuition that marginalist approaches don’t reliably work. I do agree that my combination of “marginalist approaches don’t reliably improve things” and “P(doom) is <50%” is a rare one, but I was only making the former point above (and people upvoted it accordingly), so it feels a bit misleading to focus on the rareness of the overall position.
(Interestingly, while the combination I describe above is a rare one, the converse is also rare—Daniel Kokotajlo is the only person who comes to mind who disagrees with me on both of these propositions simultaneously. Note that he doesn’t characterize his current work as marginalist, but even aside from that question I think this characterization of him is accurate—e.g. he has talked to me about how changing the CEO of a given AI lab could swing his P(doom) by double digit percentage points.)
On reflection, it’s not actually about which position is more common. My real objection is that imo it was pretty obvious that something along these lines would be the crux between you and Neel (and the fact that it is a common position is part of why I think it was obvious).
Inasmuch as you are actually trying to have a conversation with Neel or address Neel’s argument on its merits, it would be good to be clear that this is the crux. I guess perhaps you might just not care about that and are instead trying to influence readers without engaging with the OP’s point of view, in which case fair enough. Personally I would find that distasteful / not in keeping with my norms around collective-epistemics but I do admit it’s within LW norms.
(Incidentally, I feel like you still aren’t quite pinning down your position—depending on what you mean by “reliably” I would probably agree with “marginalist approaches don’t reliably improve things”. I’d also agree with “X doesn’t reliably improve things” for almost any interesting value of X.)
The first two paragraphs of my original comment were trying to do this. The rest wasn’t. I flagged this in the sentence “The rest of my comment isn’t directly about this post, but close enough that this seems like a reasonable place to put it.” However, I should have been clearer about the distinction. I’ve now added the following:
Reflecting further, I think there are two parts of our earlier exchange that are a bit suspicious. The first is when I say that everyone seems to have “given up” (rather than something more nuanced like “given up on tackling the most fundamental aspects of the problem”). The second is where you summarize my position as being that we need deep scientific understanding or else everyone dies (which I think you can predict is a pretty unlikely position for me in particular to hold).
So what’s going on here? It feels like we’re both being “anchored” by extreme positions. You were rounding me off to doomerism, and I was rounding the marginalists off to “giving up”. Both I’d guess are artifacts of writing quickly and a bit frustratedly. Probably I should write a full post or shortform that characterizes more precisely what “giving up” is trying to point to.
My instinctive reaction is that this depends a lot on whether by “marginalist approaches” we mean something closer to “a single marginalist approach” or “the set of all people pursuing marginalist approaches”. I think we both agree that no single marginalist approach (e.g. investigating a given technique) makes reliable progress. However, I’d guess that I’m more willing than you to point to a broad swathe of people pursuing marginalist approaches and claim that they won’t reliably improve things.
(I have the same critique of the first two paragraphs, but thanks for the edit, it helps)
Fwiw, I am actively surprised that you have a p(doom) < 50%, I can name several lines of evidence in the opposite direction:
You’ve previously tried to define alignment based on worst-case focus and scientific approach. This suggests you believe that “marginalist” / “engineering” approaches are ~useless, from which I inferred (incorrectly) that you would have a high p(doom).
I still find the conjunction of the two positions you hold pretty weird.
I’m a strong believer in logistic success curves for complex situations. If you’re in the middle part of a logistic success curve in a complex situation, then there should be many things that can be done to improve the situation, and it seems like “engineering” approaches should work.
It’s certainly possible to have situations that prevent this. Maybe you have a bimodal distribution, e.g. 70% on “near-guaranteed fine by default” and 30% on “near-guaranteed doom by default”. Maybe you think that people have approximately zero ability to tell which things are improvements. Maybe you think we are at the far end of the logistic success curve today, but timelines are long and we’ll do the necessary science in time. But these views seem kinda exotic and unlikely to be someone’s actual views. (Idk maybe you do believe the second one.)
Obviously I had not thought through this in detail when I originally wrote my comment, and my wordless inference was overconfident in hindsight. But I stand by my overall sense that a person who thinks “engineering” approaches are near-useless will likely also have high p(doom) -- not just as a sociological observation, but also as a claim about which positions are consistent with each other.
In your writing you sometimes seem to take as a background assumption that alignment will be very hard. For example, I recall you critiquing assistance games because (my paraphrase) “that’s not what progress on a hard problem looks like”. (I failed to dig up the citation though.)
You’re generally taking a strategy that appears to me to be high variance, which people usually justify via high p(doom) / playing to your outs.
A lot of your writing is similarly flavored to other people who have high p(doom).
In terms of evidence that you have a p(doom) < 50%, I think the main thing that comes to mind is that you argued against Eliezer about this in late 2021, but that was quite a while ago (relative to the evidence above) and I thought you had changed your mind. (Also iirc the stuff you said then was consistent with p(doom) ~ 50%, but it’s long enough ago that I could easily be forgetting things.)
You could point to ~any reasonable subcommunity within AI safety (or the entire community) and I’d still be on board with the claim that there’s at least a 10% chance that will make things worse, which I might summarize as “they won’t reliably improve things”, so I still feel like this isn’t quite capturing the distinction. (I’d include communities focused on “science” in that, but I do agree that they are more likely not to have a negative sign.) So I still feel confused about what exactly your position is.
Kind of a tangent:
Thanks for pointing this out.
I’ve been thinking lately about how much folks around more or less dismiss the idea of an AI pause as unrealistic because we’re not going to get that much political buyin.
I (speculatively) think that this is a bit trapped in a mindset that is assuming the conclusion. Big political changes like that one have happened in the past, and they have often seemed impossible before they happened and inevitable in retrospect. And, when something big like that changes, part of the process is a cascade, where whole deferral structures change their mind / attitude / preferences, about something. How much buyin you have before that cascade happens may not be very indicative of where that cascade can end up.
I, personally, don’t feel like I know how to “call it” when big changes are on the table or when they’re not. But it sure does seem like people are counting us out much too early, given the fundamentals of the situation. We all think that the world is going to change very radically in the next few years. It’s not clear what kinds of cascades are on the table.
I provisionally think that we should feel less bashful about advocating for an AI pause, and more agnostic about how likely that is to come to pass.
I agree with you, but also think you’re not going far enough. In a world where things are changing radically, the space of possibilities opens up dramatically. And so it’s less a question of “does advocating for policy X become viable?”, and more a question of “how can we design the kinds of policies that our past selves wouldn’t even have been able to conceive of?”
In other words, in a world that’s changing a lot, you want to avoid privileging your hypotheses in advance, which is what it feels like the “pro AI pause vs anti AI pause” debate is doing.
(And yes, in some sense those radical future policies might fall into a broad category like “AI pause”. But that doesn’t mean that our current conception of “AI pause” is a very useful guide for how to make those future policies come about.)
Whoa, you think the scenarios I’m focusing on are marginalist? I didn’t expect you to say that. I generally think of what we are doing as (a) forecasting and (b) making ambitious solve-approximately-all-the-problems plans to present to the world. Forecasting isn’t marginalist, it’s a type error to think so, and as for our plans, well, they seem pretty ambitious to me.
I regret using the word “marginalist”, it’s a bit too confusing. But I do have a pretty high bar for what counts as “ambitious” in the political domain—it involves not just getting the system to do something, but rather trying to change the system itself. Cummings and Thiel are central examples (Geoff Anders maybe also was aiming in that direction at one point).
I’m reminded of Patrick Collison’s (I now think, quite wise) comment on EA:
I agree with this statement denotatively, and my own interests/work have generally been “driven by open-ended curiosity and a drive to uncover deep truths”, but isn’t this kind of motivation also what got humanity into its current mess? In other words, wasn’t the main driver of AI progress this kind of curiosity (until perhaps the recent few years when it has been driven more by commercial/monetary/power incentives)?
I would hesitate to encourage more people to follow their own curiosity, for this reason, even people who are already in AI safety research, due to the consideration of illegible safety problems, which can turn their efforts net-negative if they’re being insufficiently strategic (which seems hard to do while also being driven mainly by curiosity).
I think I’ve personally been lucky, or skilled in some way that I don’t understand, in that my own curiosity has perhaps been more aligned with what’s good than most people’s, but even some of my interests, e.g. in early cryptocurrency, might have been net-negative.
I guess this is related to our earlier discussion about how important being virtuous is to good strategy/prioritization, and my general sense is that consistently good strategy requires a high amount of consequentialist reasoning, because the world is too complicated and changes too much and too frequently to rely on pre-computed shortcuts. It’s hard for me to understand how largely intuitive/nonverbal virtues/curiosity could be doing enough “compute” or “reasoning” to consistently output good strategy.
Interestingly, I was just having a conversation with Critch about this. My contention was that, in the first few decades of the field, AI researchers were actually trying to understand cognition. The rise of deep learning (and especially the kind of deep learning driven by massive scaling) can be seen as the field putting that quest on hold in order to optimize for more legible metrics.
I don’t think you should find this a fully satisfactory answer, because it’s easy to “retrodict” ways that my theory was correct. But that’s true of all explanations of what makes the world good at a very abstract level, including your own answer of metaphilosophical competence. (Also, we can perhaps cash my claim out in predictions, like: was a significant barrier to more researchers working on deep learning the criticism that it didn’t actually provide good explanations of or insight into cognition? Without having looked it up, I suspect so.)
I don’t think that’s true. However I do think it requires deep curiosity about what good strategy is and how it works. It’s not a coincidence that my own research on a theory of coalitional agency was in significant part inspired by strategic failures of EA and AI safety (with this post being one of the earliest building blocks I laid down). I also suspect that the full theory of coalitional agency will in fact explain how to do metaphilosophy correct, because doing good metaphilosophy is ultimately a cognitive process and can therefore be characterized by a sufficiently good theory of cognition.
Again, I don’t expect you to fully believe me. But what I most want to read from you right now is an in-depth account of which the things in the world have gone or are going most right, and the ways in which you think metaphilosophical competence or consequentialist reasoning contributed to them. Without that, it’s hard to trust metaphilosophy or even know what it is (though I think you’ve given a sketch of this in a previous reply to me at some point).
I should also try to write up the same thing, but about how virtues contributed to good things. And maybe also science, insofar as I’m trying to defend doing more science (of cognition and intelligence) in order to help fix risks caused by previous scientific progress.
(First a terminological note: I wouldn’t use the phrase “metaphilosophical competence”, and instead tend to talk about either “metaphilosophy”, meaning studying the nature of philosophy and philosophical reasoning, how should philosophical problems be solved, etc., or “philosophical competence”, meaning how good someone is at solving philosophical problems or doing philosophical reasoning. And sometimes I talk about them together, like in “metaphilosophy / AI philosophical competence” because I think solving metaphilosophy is the best way to improve AI philosophical competence. Here I’ll interpret you to just mean “philosophical competence”.)
To answer your question, it’s pretty hard to think of really good examples, I think because humans are very bad at both philosophical competence and consequentialist reasoning, but here are some:
the game theory around nuclear deterrence, helping to prevent large-scale war so far
economics and its influence on government policy, e.g., providing support for property rights, markets, and regulations around things like monopolies and externalities (but it’s failing pretty badly on AGI/ASI)
analytical philosophy making philosophical progress in so far as asking important questions and delineating various plausible answers (but doing badly as far as individually having inappropriate levels of confidence, as well as failing to focus on the really important problems, e.g., related to AI safety)
certain philosophers / movements (rationalists, EA) emphasizing philosophical (especially moral) uncertainty to some extent, and realizing the importance of AI safety
MIRI updating on evidence/arguments and pivoting strategy in response (albeit too slowly)
I guess this isn’t an “in-depth account” but I’m also not sure why you’re asking for “in-depth”, i.e., why doesn’t a list like this suffice?
I think non-consequentialist reasoning or ethics probably worked better in the past, when the world changed more slowly and we had more chances to learn from our mistakes (and refine our virtues/deontology over time), so I wouldn’t necessarily find this kind of writing very persuasive, unless it somehow addressed my central concern that virtues do not seem to be a kind of thing that is capable of doing enough “compute/reasoning” to find consistently good strategies in a fast changing environment on the first try.
If this is true, then it should significantly update us away from the strategy “solve our current problems by becoming more philosophically competent and doing good consequentialist reasoning”, right? If you are very bad at X, then all else equal you should try to solve problems using strategies that don’t require you to do much X.
You might respond that there are no viable strategies for solving our current problems without applying a lot of philosophical competence and consequentialist reasoning. I think scientific competence and virtue ethics are plausibly viable alternative strategies (though the line between scientific and philosophical competence seems blurry to me, as I discuss below). But even given that we disagree on that, humanity solved many big problems in the past without using much philosophical competence and consequentialist reasoning, so it seems hard to be confident that we won’t solve our current problems in other ways.
Out of your examples, the influence of economics seems most solid to me. I feel confused about whether game theory itself made nuclear war more or less likely—e.g. von Neumann was very aggressive, perhaps related to his game theory work, and maybe MAD provided an excuse to stockpile weapons? Also the Soviets didn’t really have the game theory IIRC.
On the analytical philosophy front, the clearest wins seem to be cases where they transitioned from doing philosophy to doing science or math—e.g. the formalization of probability (and economics to some extent too). If this is the kind of thing you’re pointing at, then I’m very much on board—that’s what I think we should be doing for ethics and intelligence. Is it?
Re the AI safety stuff: it all feels a bit too early to say what its effects on the world have been (though on net I’m probably happy it has happened).
Because I have various objections to this list (some of which are detailed above) and with such a succinct list it’s hard to know which aspects of them you’re defending, which arguments for their positive effects you find most compelling, etc.