Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
Richard_Ngo
To answer your question, it’s pretty hard to think of really good examples, I think because humans are very bad at both philosophical competence and consequentialist reasoning, but here are some:
If this is true, then it should significantly update us away from the strategy “solve our current problems by becoming more philosophically competent and doing good consequentialist reasoning”, right? If you are very bad at X, then all else equal you should try to solve problems using strategies that don’t require you to do much X.
You might respond that there are no viable strategies for solving our current problems without applying a lot of philosophical competence and consequentialist reasoning. I think scientific competence and virtue ethics are plausibly viable alternative strategies (though the line between scientific and philosophical competence seems blurry to me, as I discuss below). But even given that we disagree on that, humanity solved many big problems in the past without using much philosophical competence and consequentialist reasoning, so it seems hard to be confident that we won’t solve our current problems in other ways.
Out of your examples, the influence of economics seems most solid to me. I feel confused about whether game theory itself made nuclear war more or less likely—e.g. von Neumann was very aggressive, perhaps related to his game theory work, and maybe MAD provided an excuse to stockpile weapons? Also the Soviets didn’t really have the game theory IIRC.
On the analytical philosophy front, the clearest wins seem to be cases where they transitioned from doing philosophy to doing science or math—e.g. the formalization of probability (and economics to some extent too). If this is the kind of thing you’re pointing at, then I’m very much on board—that’s what I think we should be doing for ethics and intelligence. Is it?
Re the AI safety stuff: it all feels a bit too early to say what its effects on the world have been (though on net I’m probably happy it has happened).
I guess this isn’t an “in-depth account” but I’m also not sure why you’re asking for “in-depth”, i.e., why doesn’t a list like this suffice?
Because I have various objections to this list (some of which are detailed above) and with such a succinct list it’s hard to know which aspects of them you’re defending, which arguments for their positive effects you find most compelling, etc.
Inasmuch as you are actually trying to have a conversation with Neel or address Neel’s argument on its merits, it would be good to be clear that this is the crux.
The first two paragraphs of my original comment were trying to do this. The rest wasn’t. I flagged this in the sentence “The rest of my comment isn’t directly about this post, but close enough that this seems like a reasonable place to put it.” However, I should have been clearer about the distinction. I’ve now added the following:
EDIT: to be more clear: the rest of this comment is not primarily about Neel or “pragmatic interpretability”, it’s about parts of the field that I consider to be significantly less relevant to “solving alignment” than that (though work that’s nominally on pragmatic interpretability could also fall into the same failure modes). I clarify my position further in this comment; thanks Rohin for the pushback.
Reflecting further, I think there are two parts of our earlier exchange that are a bit suspicious. The first is when I say that everyone seems to have “given up” (rather than something more nuanced like “given up on tackling the most fundamental aspects of the problem”). The second is where you summarize my position as being that we need deep scientific understanding or else everyone dies (which I think you can predict is a pretty unlikely position for me in particular to hold).
So what’s going on here? It feels like we’re both being “anchored” by extreme positions. You were rounding me off to doomerism, and I was rounding the marginalists off to “giving up”. Both I’d guess are artifacts of writing quickly and a bit frustratedly. Probably I should write a full post or shortform that characterizes more precisely what “giving up” is trying to point to.
(Incidentally, I feel like you still aren’t quite pinning down your position—depending on what you mean by “reliably” I would probably agree with “marginalist approaches don’t reliably improve things”. I’d also agree with “X doesn’t reliably improve things” for almost any interesting value of X.)
My instinctive reaction is that this depends a lot on whether by “marginalist approaches” we mean something closer to “a single marginalist approach” or “the set of all people pursuing marginalist approaches”. I think we both agree that no single marginalist approach (e.g. investigating a given technique) makes reliable progress. However, I’d guess that I’m more willing than you to point to a broad swathe of people pursuing marginalist approaches and claim that they won’t reliably improve things.
I expect it’s not worth our time to dig too deep into whose position is more common here. But I think that a lot of people on LW have high P(doom) in significant part because they share my intuition that marginalist approaches don’t reliably work. I do agree that my combination of “marginalist approaches don’t reliably improve things” and “P(doom) is <50%” is a rare one, but I was only making the former point above (and people upvoted it accordingly), so it feels a bit misleading to focus on the rareness of the overall position.
(Interestingly, while the combination I describe above is a rare one, the converse is also rare—Daniel Kokotajlo is the only person who comes to mind who disagrees with me on both of these propositions simultaneously. Note that he doesn’t characterize his current work as marginalist, but even aside from that question I think this characterization of him is accurate—e.g. he has talked to me about how changing the CEO of a given AI lab could swing his P(doom) by double digit percentage points.)
I agree with this statement denotatively, and my own interests/work have generally been “driven by open-ended curiosity and a drive to uncover deep truths”, but isn’t this kind of motivation also what got humanity into its current mess? In other words, wasn’t the main driver of AI progress this kind of curiosity (until perhaps the recent few years when it has been driven more by commercial/monetary/power incentives)?
Interestingly, I was just having a conversation with Critch about this. My contention was that, in the first few decades of the field, AI researchers were actually trying to understand cognition. The rise of deep learning (and especially the kind of deep learning driven by massive scaling) can be seen as the field putting that quest on hold in order to optimize for more legible metrics.I don’t think you should find this a fully satisfactory answer, because it’s easy to “retrodict” ways that my theory was correct. But that’s true of all explanations of what makes the world good at a very abstract level, including your own answer of metaphilosophical competence. (Also, we can perhaps cash my claim out in predictions, like: was a significant barrier to more researchers working on deep learning the criticism that it didn’t actually provide good explanations of or insight into cognition? Without having looked it up, I suspect so.)
consistently good strategy requires a high amount of consequentialist reasoning
I don’t think that’s true. However I do think it requires deep curiosity about what good strategy is and how it works. It’s not a coincidence that my own research on a theory of coalitional agency was in significant part inspired by strategic failures of EA and AI safety (with this post being one of the earliest building blocks I laid down). I also suspect that the full theory of coalitional agency will in fact explain how to do metaphilosophy correct, because doing good metaphilosophy is ultimately a cognitive process and can therefore be characterized by a sufficiently good theory of cognition.
Again, I don’t expect you to fully believe me. But what I most want to read from you right now is an in-depth account of which the things in the world have gone or are going most right, and the ways in which you think metaphilosophical competence or consequentialist reasoning contributed to them. Without that, it’s hard to trust metaphilosophy or even know what it is (though I think you’ve given a sketch of this in a previous reply to me at some point).
I should also try to write up the same thing, but about how virtues contributed to good things. And maybe also science, insofar as I’m trying to defend doing more science (of cognition and intelligence) in order to help fix risks caused by previous scientific progress.
In trying to reply to this comment I identified four “waves” of AI safety, and lists of the central people in each wave. Since this is socially complicated I’ll only share the full list of the first wave here, and please note that this is all based on fuzzy intuitions gained via gossip and other unreliable sources.
The first wave I’ll call the “founders”; I think of them as the people who set up the early institutions and memeplexes of AI safety before around 2015. My list:
Eliezer Yudkowsky
Michael Vassar
Anna Salamon
Carl Schulman
Scott Alexander
Holden Karnofsky
Nick Bostrom
Robin Hanson
Wei Dai
Shane Legg
Geoff Anders
The second wave I’ll call the “old guard”; those were the people who joined or supported the founders before around 2015. A few central examples include Paul Christiano, Chris Olah, Andrew Critch and Oliver Habryka.
Around 2014/2015 AI safety became significantly more professionalized and growth-oriented. Bostrom published Superintelligence, the Puerto Rico conference happened, OpenAI was founded, DeepMind started a safety team (though I don’t recall exactly when), and EA started seriously pushing people towards AI safety. I’ll call the people who entered the field from then until around 2020 “safety scalers” (though I’m open to better names). A few central examples include Miles Brundage, Beth Barnes, John Wentworth, Rohin Shah, Dan Hendrycks and myself.
And then there’s the “newcomers” who joined in the last 5-ish years. I have a worse mental map of these people, but some who I respect are Leo Gao, Sahil, Marius Hobbhahn and Jesse Hoogland.
In this comment I expressed concern that my generation (by which I mean the “safety scalers”) have kinda given up on solving alignment. But another higher-level concern is: are people from these last two waves the kinds of people who would have been capable of founding AI safety in the first place? And if not, where are those people now? Of course there’s some difference in the skills required for founding a field vs pushing the field forward, but to a surprising extent I keep finding that the people who I have the most insightful conversations with are the ones who were around from the very beginning. E.g. I think Vassar is the single person doing the best thinking about the lessons we can learn about failures of AI safety over the last decade (though he’s hard to interface with), and Yudkowsky is still the single person who’s most able to push the Overton window towards taking alignment seriously (even though in principle many other people could have written (less doomy versions of) his Time op-ed or his recent book), Scott is still the single best blogger in the space, and so on.
Relatedly, when I talk to someone who’s exceptionally thoughtful about politics (and particularly the psychological aspects of politics), a disturbingly large proportion of the time it turns out that worked at (or were somehow associated with) Leverage. This is really weird to me. Maybe I just have Leverage-aligned tastes/networks, but even so, it’s a very striking effect. (Also, how come there’s no young Moldbug?)
Assuming that I’m gesturing at something real, what are some possible explanations?
There was a unique historical period during which blogging culture was coming online, during which a bunch of ideas and people could come together. This is hard for anyone to replicate now, and so they can’t “level up” in the same way.
This is just what it’s like to be “inside” a paradigm in general. Founding it seems like a really impressive achievement that nobody can match by pushing it forward incrementally; and the founders seem brilliant because they can operate the paradigm better than anyone else. Eventually the issues with this paradigm will pile up enough that someone else can found a new paradigm.
The “takeover” of AI safety by EA changed the kinds of people who were attracted to it. The kinds of people who could found a movement have gone elsewhere (but where?)
The world produces fewer of the kinds of people who are capable of founding movements like this now (but why?)
This is all only a rough gesture at the phenomenon, and you should be wary that I’m just being pessimistic rather than identifying something important. Also it’s a hard topic to talk about clearly because it’s loaded with a bunch of social baggage. But I do feel pretty confused and want to figure this stuff out.
Yepp, makes sense, and it’s a good reminder for me to be careful about how I use these terms.
One clarification I’d make to your original comment though is that I don’t endorse “you have to deeply understand intelligence from first principles else everyone dies”. My position is closer to “you have to be trying to do something principled in order for your contribution to be robustly positive”. Relatedly, agent foundations and mech-interp are approximately the only two parts of AI safety that seem robustly good to me—with a bunch of other stuff like RLHF, or evals, or (almost all) governance work, I feel pretty confused about whether they’re good or bad or basically just wash out even in expectation.
This is still consistent with risk potentially being reduced by what I call engineering-type work, it’s just that IMO that involves us “getting lucky” in an important way which I prefer we not rely on. (And trying to get lucky isn’t a neutral action—engineering-type work can also easily have harmful effects.)
I agree that there are some ways in which my comment did not meet the standard that I was holding your post to. I think this is defensible because I hold things to higher standards when they’re more prominent (e.g. posts versus shortforms or comments), and also because I hold things to higher standards when they’re making stronger headline claims. In my case, my headline claim was “I feel confused”. If I had instead made the headline claim “Mikhail is untrustworthy”, then I think it would have been very reasonable for you to be angry at this.
I think that my criticism contains some moves that I wish your criticism had more of. In particular, I set a standard for what I wanted from your criticism:
I think of good critiques as trying to identify standards of behavior that should be met, and comparing people or organizations to those standards, rather than just throwing accusations at them.
and provide a central example of you not meeting this standard:
“Anthropic is untrustworthy” is an extremely low-resolution claim
I also primarily focused on drawing conclusions about the post itself (e.g. “My overall sense is that people should think of the post roughly the way they think of a compilation of links”) and relegate the psychologizing to the end. I accept that you would have preferred that I skip it entirely, but it’s a part of “figuring out what’s up with Mikhail”, which is an epistemic move that I endorse people doing after they’ve laid out a disagreement (but not as a primary approach to that disagreement).
Some examples of statements where it’s pretty hard for me to know how much the statements straightforwardly follow from the evidence you have, vs being things that you’ve inferred because they seem plausible to you:
Jack Clark would have known this.
Anthropic’s somewhat less problematic behavior is fully explained by having to maintain a good image internally.
Anthropic is now basically just as focused on commercializing its products.
Anthropic’s talent is a core pitch to investors: they’ve claimed they can do what OpenAI can for 10x cheaper.
It seems likely that the policy positions that Anthropic took early on were related to these incentives.
Anthropic’s mission is not really compatible with the idea of pausing, even if evidence suggests it’s a good idea to.
If we zoom in on #3, for instance: there’s a sense in which it’s superficially plausible because both OpenAI and Anthropic have products. But maybe Anthropic and OpenAI differ greatly on, say, the ratio of headcount, or the ratio of executives’ time, or the amount of compute, or the internal prestige allocated to commercialization vs other things (like alignment research). If so, then it’s not really accurate to say that they’re just as focused on commercialization. But I don’t know if knowledge of these kinds of considerations informed your claim, or if you’re only making the superficially plausible version of the claim.
To be clear, in general I don’t expect people to apply this level of care for most LW posts. But when it comes to accusations of untrustworthiness (and similar kinds of accountability mechanisms) I think it’s really valuable to be able to create common knowledge of the specific details of misbehavior. Hence I would have much preferred this post to focus on a smaller set of claims that you can solidly substantiate, and then only secondarily try to discuss what inferences we should draw from those. Whereas I think that the kinds of criticism you make here mostly create a miasma of distrust between Anthropic and LessWrong, without adding much common knowledge of the form “Anthropic violated clear and desirable standard X” for the set of good-faith AI safety actors.
I also realize that by holding this standard I’m making criticism more costly, because now you have the stress of trying to justify yourself to me. I would have tried harder to mitigate that cost if I hadn’t noticed this pattern of not-very-careful criticism from you. I do sympathize with your frustration that people seem to be naively trusting Anthropic and ignoring various examples of shady behavior. However I also think people outside labs really underestimate how many balls lab leaders have up in the air at once, and how easy it is to screw up a few of them even if you’re broadly trustworthy. I don’t know how to balance these considerations, especially because the community as a whole has historically erred on the side of the former mistake. I’d appreciate people helping me think through this, e.g. by working through models of how applying pressure to bureaucratic organizations goes successfully, in light of the ways that such organizations become untrustworthy (building on Zvi’s moral mazes sequence for instance).
I regret using the word “marginalist”, it’s a bit too confusing. But I do have a pretty high bar for what counts as “ambitious” in the political domain—it involves not just getting the system to do something, but rather trying to change the system itself. Cummings and Thiel are central examples (Geoff Anders maybe also was aiming in that direction at one point).
I think me using the word “marginalist” was probably a mistake, because it conflates two distinct things that I’m skeptical about:
People no longer trying to make models more aligned (but e.g. trying to do work that primarily cashes out in political outcomes). This is what I mean by “not even aspiring to be the type of thing that could solve alignment”.
People using engineering-type approaches (rather than science-type approaches) to try to make models more aligned.
The list I gave above was of things that fall into category 1, whereas (almost?) all of the things you named fall into category 2. What I want more of is category 3: science-type approaches. One indicator that something is a science-type approach is that it could potentially help us understand something fundamental about intelligence; another is that, if it works, we’ll know in advance (I used to not care about this, but have changed my mind).
I think there are versions of most of the things you named that could be in category 3, but people mostly seem to be doing category-2 versions of them, in significant part because of the sort of EA-style reasoning that I was criticizing from Neel’s original post.
When I wrote “pragmatic interpretability feels like another step in that direction” I meant something like: ambitious interpretability was trying to do 3, and pragmatic interpretability seems like it’s nominally trying to do 2, and may in practice end up being mostly 1. For example, “Stop models acting differently when tested” could be a part of an engineering-type pipeline for fixing misalignments in models, but could also end up drifting towards “help us get better evidence to convince politicians and lab leaders of things”. However, I’m not claiming that pragmatic interpretability is a central example of “not even aspiring to be the type of thing that could solve alignment”. Apologies for the bad phrasings.
Thanks for writing this up. While I don’t have much context on what specifically has gone well or badly for your team, I do feel pretty skeptical about the types of arguments you give at several points: in particular focusing on theories of change, having the most impact, comparative advantage, work paying off in 10 years, etc. I expect that this kind of reasoning itself steers people away from making important scientific contributions, which are often driven by open-ended curiosity and a drive to uncover deep truths.
(A provocative version of this claim: for the most important breakthroughs, it’s nearly impossible to identify a theory of change for them in advance. Imagine Newton or Darwin trying to predict how understanding mechanics/evolution would change the world. Now imagine them trying to do that before they had even invented the theory! And finally imagine if they only considered plans that they thought would work within 10 years, and the sense of scarcity and tension that would give rise to.)
The rest of my comment isn’t directly about this post, but close enough that this seems like a reasonable place to put it. EDIT: to be more clear: the rest of this comment is not primarily about Neel or “pragmatic interpretability”, it’s about parts of the field that I consider to be significantly less relevant to “solving alignment” than that (though work that’s nominally on pragmatic interpretability could also fall into the same failure modes). I clarify my position further in this comment; thanks Rohin for the pushback.
I get the sense that there was a “generation” of AI safety researchers who have ended up with a very marginalist mindset about AI safety. Some examples:
the evals that Beth Barnes (and maybe Dan Hendrycks?) are focusing on
the scenarios that Daniel Kokotajlo is focusing on
the models of misalignment that Evan Hubinger is focusing on
the forecasting that the OpenPhil worldview investigations team focused on
scary demos
safety cases
policy approaches like SB-1047
In other words, whole swathes of the field are not even aspiring to be the type of thing that could solve misalignment. In the terminology of this excellent post, they are all trying to attack a category I problem not a category II problem. Sometimes it feels like
almost the entire fieldEDIT: most of the field is Goodharting on the subgoal of “write a really persuasive memo to send to politicians”. Pragmatic interpretability feels like another step in that direction (EDIT: but still significantly more principled than the things I listed above).This is all related to something Buck recently wrote: “I spend most of my time thinking about relatively cheap interventions that AI companies could implement to reduce risk assuming a low budget, and about how to cause AI companies to marginally increase that budget”. I’m sure Buck has thought a lot about his strategy here, and I’m sure that you’ve thought a lot about your strategy as laid out in this post, and so on. But a part of me is sitting here thinking: man, everyone sure seems to have given up. (And yes, I know it doesn’t feel like giving up from the inside, but from my perspective that’s part of the problem.)
Now, a lot of the “old guard” seems to have given up too. But they at least know what they’ve given up on. There was an ideal of fundamental scientific progress that MIRI and Paul and a few others were striving towards; they knew at least what it would feel like (if not what it would look like) to actually make progress towards understanding intelligence. Eliezer and various others no longer think that’s plausible. I disagree. But aside from the object-level disagreement, I really want people to be aware that this is a thing that’s at least possible in principle to aim for, lest the next generation of the AI safety community ends up giving up on it before they even know what they’ve given up on.
(I’ll leave for another comment/post the question of what went wrong in my generation. The “types of arguments” I objected to above all seem quite EA-flavored, and so one salient possibility is just that the increasing prominence of EA steered my generation away from the type of mentality in which it’s even possible to aim towards scientific breakthroughs. But even if that’s one part of the story, I expect it’s more complicated than that.)
Thinking more about the cellular automaton stuff: okay, so Game of Life is Turing complete. But the question is whether we can pin down properties that GoL has that Turing machines don’t have.
I have a vague recollection that parallel Turing Machines are a thing, but this paper claims that the actual formalisms are disappointing. One nice thing about Game of Life is that the way that different programs interact internally (via game of life physics) is also how they interact with each other. Whereas any multi-tape Turing Machine (even one with clever rules about how to integrate inputs from multiple tapes) wouldn’t have that property.
I feel like I’m not getting beyond the original idea that Game of Life could have adversarial robustness in a way that Turing Machines don’t. But it feels like you’d need to demonstrate this with some construction that’s actually adversarially robust, which seems difficult.
Your post is great, I encourage you to repost it.
Sometimes, conclusions don’t need to be particularly nuanced. Sometimes, a system is built of many parts, and yet a valid, non-misleading description of that system as a whole is that it is untrustworthy.
The central case where conclusions don’t need to be particularly nuanced is when you’re engaged in a conflict and you’re trying to attack the other side.
In other cases, when you’re trying to figure out how the world works and act accordingly, nuance typically matters a lot.
Calling an organization “untrustworthy” is like calling a person “unreliable”. Of course some people are more reliable than others, but when you smuggle in implicit binary standards you are making it harder in a bunch of ways to actually model the situation.
I sent Mikhail the following via DM, in response to his request for “any particular parts of the post [that] unfairly attack Anthropic”:
I think that the entire post is optimized to attack Anthropic, in a way where it’s very hard to distinguish between evidence you have, things you’re inferring, standards you’re implicitly holding them to, standards you’re explicitly holding them to, etc.
My best-guess mental model here is that you were more careful about this post than about the other posts, but that there’s a common underlying generator to all of them, which is that you’re missing some important norms about how healthy critique should function.
I don’t expect to be able to convey those norms or their importance to you in this exchange, but I’ll consider writing up a longform post about them.I think Situational Awareness is a pretty good example of what it looks like for an essay to be optimized for a given outcome at the expense of epistemic quality. In Situational Awareness, it’s less that any given statement is egregiously false, and more that there were many choices made to try to create a conceptual frame that promoted racing. I have critiqued this at various points (and am writing up a longer critique) but what I wanted from Leopold was something more like “here are the key considerations in my mind, here’s how I weigh them up, here’s my nuanced conclusion, here’s what would change my mind”. And that’s similar to what I want from posts like yours too.
I find this terrifying, that I might be incompetent in many ways, and that if I had a little more awareness, a little more “oomph” I could be much better.
Consider whether the awareness of the terror is itself one of the key steps towards becoming more competent.
That is, much incompetence is caused by suppressed fear, which thereby becomes a self-fulfilling prophecy.
(Apologies for the vagueness here, though I guess my sequence on this elaborates.)
Someone on the EA forum asked why I’ve updated away from public outreach as a valuable strategy. My response:
I used to not actually believe in heavy-tailed impact. On some gut level I thought that early rationalists (and to a lesser extent EAs) had “gotten lucky” in being way more right than academic consensus about AI progress. I also implicitly believed that e.g. Thiel and Musk and so on kept getting lucky, because I didn’t want to picture a world in which they were actually just skillful enough to keep succeeding (due to various psychological blockers).
Now, thanks to dealing with a bunch of those blockers, I have internalized to a much greater extent that you can actually be good not just lucky. This means that I’m no longer interested in strategies that involve recruiting a whole bunch of people and hoping something good comes out of it. Instead I am trying to target outreach precisely to the very best people, without compromising much.
Relatedly, I’ve updated that the very best thinkers in this space are still disproportionately the people who were around very early. The people you need to soften/moderate your message to reach (or who need social proof in order to get involved) are seldom going to be the ones who can think clearly about this stuff. And we are very bottlenecked on high-quality thinking.
(My past self needed a lot of social proof to get involved in AI safety in the first place, but I also “got lucky” in the sense of being exposed to enough world-class people that I was able to update my mental models a lot—e.g. watching the OpenAI board coup close up, various conversations with OpenAI cofounders, etc. This doesn’t seem very replicable—though I’m trying to convey a bunch of the models I’ve gained on my blog, e.g. in this post.)
I feel confused about how to engage with this post. I agree that there’s a bunch of evidence here that Anthropic has done various shady things, which I do think should be collected in one place. On the other hand, I keep seeing aggressive critiques from Mikhail that I think are low-quality (more context below), and I expect that a bunch of this post is “spun” in uncharitable ways.
That is, I think of the post as primarily trying to do the social move of “lower trust in Anthropic” rather than the epistemic move of “try to figure out what’s up with Anthropic”. The latter would involve discussion of considerations like: sometimes lab leaders need to change their minds. To what extent are disparities in their statements and actions evidence of deceptiveness versus changing their minds? Etc. More generally, I think of good critiques as trying to identify standards of behavior that should be met, and comparing people or organizations to those standards, rather than just throwing accusations at them.
EDIT: as one salient example, “Anthropic is untrustworthy” is an extremely low-resolution claim. Someone who was trying to help me figure out what’s up with Anthropic should e.g. help me calibrate what they mean by “untrustworthy” by comparison to other AI labs, or companies in general, or people in general, or any standard that I can agree or disagree with. Whereas someone who was primarily trying to attack Anthropic is much more likely to use that particular term as an underspecified bludgeon.
My overall sense is that people should think of the post roughly the way they think of a compilation of links, and mostly discard the narrativizing attached to it (i.e. do the kind of “blinding yourself” that Habryka talks about here).
Context: I’m thinking in particular of two critiques. The first was of Oliver Habryka. I feel pretty confident that this was a bad critique, which overstated its claims on the basis of pretty weak evidence. The second was Red Queen Bio. Again, it seemed like a pretty shallow critique: it leaned heavily on putting the phrases “automated virus-producing equipment” and “OpenAI” in close proximity to each other, without bothering to spell out clear threat models or what he actually wanted to happen instead (e.g. no biorisk companies take money from OpenAI? No companies that are capable of printing RNA sequences use frontier AI models?)
In that case I didn’t know enough about the mechanics of “virus-producing equipment” to have a strong opinion, but I made a mental note that Mikhail tended to make “spray and pray” critiques that lowered the standard of discourse. (Also, COI note: I’m friends with the founders of Red Queen Bio, and was one of the people encouraging them to get into biorisk in the first place. I’m also friends with Habryka, and have donated recently to Lightcone. EDIT to add: about 2⁄3 of my net worth is in OpenAI shares, which could become slightly more valuable if Red Queen Bio succeeds.)
Two (even more) meta-level considerations here (though note that I don’t consider these to be as relevant as the stuff above, and don’t endorse focusing too much on them):
For reference, the other person I’ve drawn the most similar conclusion about was Alexey Guzey (e.g. of his critiques here, here, and in some internal OpenAI docs). I notice that he and Mikhail are both Russian. I do have some sympathy for the idea that in Russia it’s very appropriate to assume a lot of bad faith from power structures, and I wonder if that’s a generator for these critiques.
I’m curious if this post was also (along with the Habryka critique) one of Mikhail’s daily Inkhaven posts. If so it seems worth thinking about whether there are types of posts that should be written much more slowly, and which Inkhaven should therefore discourage from being generated by the “ship something every day” process.
Hmm, I don’t have anything substantive out on this specifically; the closest is probably this talk (though note that some of my arguments in it were a bit sloppy, e.g. as per the top comment).
if there are sufficiently many copies, it becomes impossible to corrupt them all at once.
So I don’t love this model because escaping corruption is ‘too easy’.
I really like the cellular automaton model. But I don’t think it makes escaping corruption easy! Even if most of the copies are non-corrupt, the question is how you can take a “vote” of the corrupt vs non-corrupt copies without making the voting mechanism itself be easily corrupted. That’s why I was talking about the non-corrupt copies needing to “overpower” the corrupt copies above.
I’m not surprised by this, my sense is that it’s usually young people and outsiders who pioneer new fields. Older people are just so much more shaped by existing paradigms, and also have so much more to lose, that it outweighs the benefits of their expertise and resources.
Also 1993 to 2000 doesn’t seem like that large a gap to me. Though I guess the thing I’m pointing at could also be summarized as “why hasn’t someone created a new paradigm of AI safety in the last decade?” And one answer is that Paul and Chris and a few others created a half-paradigm of “ML safety”, but it hasn’t yet managed to show impressive enough results to fully take over. However, it did win on a memetic level amongst EAs in particular.
The task at hand might then be understood as synthesizing the original “AI safety” with “ML safety”. Or, to put it a bit more poetically, it’s synthesizing the rationalist approach to aligning AGI with the empiricist approach to aligning AGI.