Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
Richard_Ngo
FWIW your writings on neuroscience are a central example of “real thinking” in my mind—it seems like you’re trying to actually understand things in a way that’s far less distorted by social pressures and incentives than almost any other writing in the field.
Reading this post led me to find a twitter thread arguing (with a bunch of examples):
One of the curious things about von Neumann was his ability to do extremely impressive technical work while seemingly missing all the big insights.
I then responded to it with my own thread arguing:
I’d even go further—I think we’re still recovering from Von Neumann’s biggest mistakes:
1. Implicitly basing game theory on causal decision theory
2. Founding utility theory on the independence axiom
3. Advocating for nuking the USSR as soon as possibleI’m not confident in my argument, but it suggests the possibility that von Neumann’s concern about his legacy was tracking something important (though, even if so, it’s unlikely that feeling insecure was a good response).
If someone predicts in advance that something is obviously false, and then you come to believe that it’s false, then you should update not just towards thought processes which would have predicted that the thing is false, but also towards thought processes which would have predicted that the thing is obviously false. (Conversely, if they predict that it’s obviously false, and it turns out to be true, you should update more strongly against their thought processes than if they’d just predicted it was false.)
IIRC Eliezer’s objection to bioanchors can be reasonably interpreted as an advance prediction that “it’s obviously false”, though to be confident I’d need to reread his original post (which I can’t be bothered to do right now).
It’s not that moderates and radicals are trying to answer different questions (and the questions moderates are answering are epistemically easier like physics).
That seems totally wrong. Moderates are trying to answer questions like “what are some relatively cheap interventions that AI companies could implement to reduce risk assuming a low budget?” and “how can I cause AI companies to marginally increase that budget?” These questions are very different from—and much easier than—the ones the radicals are trying to answer, like “how can we radically change the governance of AI to prevent x-risk?”
The argument “there are specific epistemic advantages of working as a moderate” isn’t just a claim about categories that everyone agrees exist, it’s also a way of carving up the world. However, you can carve up the world in very misleading ways depending on how you lump different groups together. For example, if a post distinguished “people without crazy-sounding beliefs” from “people with crazy-sounding beliefs”, the latter category would lump together truth-seeking nonconformists with actual crazy people. There’s no easy way of figuring out which categories should be treated as useful vs useless but the evidence Eliezer cites does seem relevant.
On a more object level, my main critique of the post is that almost all of the bullet points are even more true of, say, working as a physicist. And so structurally speaking I don’t know how to distinguish this post from one arguing “one advantage of looking for my keys closer to a streetlight is that there’s more light!” I.e. it’s hard to know the extent to which these benefits come specifically from focusing on less important things, and therefore are illusory, versus the extent to which you can decouple these benefits from the costs of being a “moderate”.
Yes, that can be a problem. I’m not sure why you think that’s in tension with my comment though.
Thank you Habryka (and the rest of the mod team) for the effort and thoughtfulness you put into making LessWrong good.
I personally have had few problems with Said, but this seems like an extremely reasonable decision. I’m leaving this comment in part to help make you feel empowered to make similar decisions in the future when you think it necessary (and ideally, at a much lower cost of your time).
I think one effect you’re missing is that the big changes are precisely the ones that tend to mostly rely on factors that are hard to specify important technical details about. E.g. “should we move our headquarters to London” or “should we replace the CEO” or “should we change our mission statement” are mostly going to be driven by coalitional politics + high-level intuitions and arguments. Whereas “should we do X training run or Y training run” are more amenable to technical discussion, but also have less lasting effects.
people in companies care about technical details so to be persuasive you will have to be familiar with them
Big changes within companies are typically bottlenecked much more by coalitional politics than knowledge of technical details.
Underdog bias rules everything around me
On Pessimization
By thinking about reward in this way, I was able to predict[1] and encourage the success of this research direction.
Congratulations on doing this :) More specifically, I think there are two parts of making predictions: identifying a hypothesis at all, and then figuring out how likely the hypothesis is to be true or false. The former part is almost always the hard part, and that’s the bit where the “reward reinforces previous computations” frame was most helpful.
(I think Oliver’s pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)
Ty for the reply. A few points in response:
Of course, you might not know which problem your insights allow you to solve until you have the insights. I’m a big fan of constructing stylized problems that you can solve, after you know which insight you want to validate.
That said, I think it’s even better if you can specify problems in advance to help guide research in the field. The big risk, then, is that these problems might not be robust to paradigm shifts (because paradigm shifts could change the set of important problems). If that is your concern, then I think you should probably give object-level arguments that solving auditing games is a bad concrete problem to direct attention to. (Or argue that specifying concrete problems is in general a bad thing.)
The bigger the scientific advance, the harder it is to specify problems in advance which it should solve. You can and should keep track of the unresolved problems in the field, as Neel does, but trying to predict specifically which unresolved problems in biology Darwinian evolution would straightforwardly solve (or which unresolved problems in physics special relativity would straightforwardly solve) is about as hard as generating those theories in the first place.
I expect that when you personally are actually doing your scientific research you are building sophisticated mental models of how and why different techniques work. But I think that in your community-level advocacy you are emphasizing precisely the wrong thing—I want junior researchers to viscerally internalize that their job is to understand (mis)alignment better than anyone else does, not to optimize on proxies that someone else has designed (which, by the nature of the problem, are going to be bad proxies).
It feels like the core disagreement is that I intuitively believe that bad metrics are worse than no metrics, because they actively confuse people/lead them astray. More specifically, I feel like your list of four problems is closer to a list of things that we should expect from an actually-productive scientific field, and getting rid of them would neuter the ability for alignment to make progress:
“Right now, by default research projects get one bit of supervision: After the paper is released, how well is it received?” Not only is this not one bit, I would also struggle to describe any of the best scientists throughout history as being guided primarily by it. Great researchers can tell by themselves, using their own judgment, how good the research is (and if you’re not a great researcher that’s probably the key skill you need to work on).
But also, note how anti-empiricism your position is. The whole point of research projects is that they get a huge amount of supervision from reality. The job of scientists is to observe that supervision from reality and construct theories that predict reality well, no matter what anyone else thinks about them. It’s not an exaggeration to say that discarding the idea that intellectual work should be “supervised” by one’s peers is the main reason that science works in the first place (see Strevens for more).“Lacking objective, consensus-backed progress metrics, the field is effectively guided by what a small group of thought leaders think is important/productive to work on.” Science works precisely because it’s not consensus-backed—see my point on empiricism above. Attempts to make science more consensus-backed undermine the ability to disagree with existing models/frameworks. But also: the “objective metrics” of science are the ability to make powerful, novel predictions in general. If you know specifically what metrics you’re trying to predict, the thing you’re doing is engineering. And some people should be doing engineering (e.g. engineering better cybersecurity)! But if you try to do it without a firm scientific foundation you won’t get far.
I think it’s good that “junior researchers who do join are unsure what to work on.” It is extremely appropriate for them to be unsure what to work on, because the field is very confusing. If we optimize for junior researchers being more confident on what to work on, we will actively be making them less truth-tracking, which makes their research worse in the long term.
Similarly, “it’s hard to tell which research bets (if any) are paying out and should be invested in more aggressively” is just the correct epistemic state to be in. Yes, much of the arguing is unproductive. But what’s much less productive is saying “it would be good if we could measure progress, therefore we will design the best progress metric we can and just optimize really hard for that”. Rather, since evaluating the quality of research is the core skill of being a good scientist, I am happy with junior researchers all disagreeing with each other and just pursuing whichever research bets they want to invest their time in (or the research bets they can get the best mentorship when working on).
Lastly, it’s also good that “it’s hard to grow the field”. Imagine talking to Einstein and saying “your thought experiments about riding lightbeams are too confusing and unquantifiable—they make it hard to grow the field. You should pick a metric of how good our physics theories are and optimize for that instead.” Whenever a field is making rapid progress it’s difficult to bridge the gap between the ontology outside the field and the ontology inside the field. The easiest way to close that gap is simply for the field to stop making rapid progress, which is what happens when something becomes a “numbers-go-up” discipline.
I think that e.g. RL algorithms researchers have some pretty deep insights about the nature of exploration, learning, etc.
They have some. But so did Galileo. If you’d turned physics into a numbers-go-up field after Galileo, you would have lost most of the subsequent progress, because you would’ve had no idea which numbers going up would contribute to progress.
I’d recommend reading more about the history of science, e.g. The Sleepwalkers by Koestler, to get a better sense of where I’m coming from.
I strongly disagree. “Numbers-Go-Up Science” is an oxymoron: great science (especially what Kuhn calls revolutionary science) comes from developing novel models or ontologies which can’t be quantitatively compared to previous ontologies.
Indeed, in an important sense, the reason the alignment problem is a big deal in the first place is that ML isn’t a science which tries to develop deep explanations of artificial cognition, but instead a numbers-go-up discipline.
And so the idea of trying to make (a subfield of) alignment more like architecture design, performance optimization or RL algorithms feels precisely backwards—it steers people directly away from the thing that alignment research should be contributing.
Strongly upvoted. Alignment researchers often feel so compelled to quickly contribute to decreasing x-risk that they end up studying non-robust categories that won’t generalize very far, and sometimes actively make the field more confused. I wish that most people doing this were just trying to do the best science they could instead.
This is a reasonable point, though I also think that there’s something important about the ways that these three frames tie together. In general it seems to me that people underrate the extent to which there are deep and reasonably-coherent intuitions underlying right-wing thinking (in part because right-wing thinkers have been bad at articulating those intuitions). Framing the post this way helps direct people to look for them.
But I could also just say that in the text instead. So if I do another post like this in the future I’ll try your approach and see if that goes better.
Yeah, I agree that it’s easy to err in that direction, and I’ve sometimes done so. Going forward I’m trying to more consistently say the “obviously I wish people just wouldn’t do this” part.
Though note that even claims like “unacceptable by any normal standards of risk management” feel off to me. We’re talking about the future of humanity, there is no normal standard of risk management. This should feel as silly as the US or UK invoking “normal standards of risk management” in debates over whether to join WW2.
FWIW the comments feel fine to me, but I’m guessing that many of the downvotes are partisan.
FWIW I broadcast the former rather than the latter because from the 25% perspective there are many possible worlds which the “stop” coalition ends up making much worse, and therefore I can’t honestly broadcast “this is ridiculous and should stop” without being more specific about what I’d want from the stop coalition.
A (loose) analogy: leftists in Iran who confidently argued “the Shah’s regime is ridiculous and should stop”. It turned out that there was so much variance in how it stopped that this argument wasn’t actually a good one to confidently broadcast, despite in some sense being correct.
You might be interested in this post of mine which makes some related claims.
(Interested to read your post more thoroughly but for now have just skimmed it and not sure when I’ll find time to engage more.)