Richard_Ngo comments on Towards Alignment Auditing as a Numbers-Go-Up Science

Richard_Ngo 5 Aug 2025 23:36 UTC
LW: 34 AF: 16
12
AF
Ty for the reply. A few points in response:
Of course, you might not know which problem your insights allow you to solve until you have the insights. I’m a big fan of constructing stylized problems that you can solve, after you know which insight you want to validate.
That said, I think it’s even better if you can specify problems in advance to help guide research in the field. The big risk, then, is that these problems might not be robust to paradigm shifts (because paradigm shifts could change the set of important problems). If that is your concern, then I think you should probably give object-level arguments that solving auditing games is a bad concrete problem to direct attention to. (Or argue that specifying concrete problems is in general a bad thing.)
The bigger the scientific advance, the harder it is to specify problems in advance which it should solve. You can and should keep track of the unresolved problems in the field, as Neel does, but trying to predict specifically which unresolved problems in biology Darwinian evolution would straightforwardly solve (or which unresolved problems in physics special relativity would straightforwardly solve) is about as hard as generating those theories in the first place.
I expect that when you personally are actually doing your scientific research you are building sophisticated mental models of how and why different techniques work. But I think that in your community-level advocacy you are emphasizing precisely the wrong thing—I want junior researchers to viscerally internalize that their job is to understand (mis)alignment better than anyone else does, not to optimize on proxies that someone else has designed (which, by the nature of the problem, are going to be bad proxies).
It feels like the core disagreement is that I intuitively believe that bad metrics are worse than no metrics, because they actively confuse people/lead them astray. More specifically, I feel like your list of four problems is closer to a list of things that we should expect from an actually-productive scientific field, and getting rid of them would neuter the ability for alignment to make progress:
- “Right now, by default research projects get one bit of supervision: After the paper is released, how well is it received?” Not only is this not one bit, I would also struggle to describe any of the best scientists throughout history as being guided primarily by it. Great researchers can tell by themselves, using their own judgment, how good the research is (and if you’re not a great researcher that’s probably the key skill you need to work on).
  
  But also, note how anti-empiricism your position is. The whole point of research projects is that they get a huge amount of supervision from reality. The job of scientists is to observe that supervision from reality and construct theories that predict reality well, no matter what anyone else thinks about them. It’s not an exaggeration to say that discarding the idea that intellectual work should be “supervised” by one’s peers is the main reason that science works in the first place (see Strevens for more).
- “Lacking objective, consensus-backed progress metrics, the field is effectively guided by what a small group of thought leaders think is important/productive to work on.” Science works precisely because it’s not consensus-backed—see my point on empiricism above. Attempts to make science more consensus-backed undermine the ability to disagree with existing models/frameworks. But also: the “objective metrics” of science are the ability to make powerful, novel predictions in general. If you know specifically what metrics you’re trying to predict, the thing you’re doing is engineering. And some people should be doing engineering (e.g. engineering better cybersecurity)! But if you try to do it without a firm scientific foundation you won’t get far.
- I think it’s good that “junior researchers who do join are unsure what to work on.” It is extremely appropriate for them to be unsure what to work on, because the field is very confusing. If we optimize for junior researchers being more confident on what to work on, we will actively be making them less truth-tracking, which makes their research worse in the long term.
- Similarly, “it’s hard to tell which research bets (if any) are paying out and should be invested in more aggressively” is just the correct epistemic state to be in. Yes, much of the arguing is unproductive. But what’s much less productive is saying “it would be good if we could measure progress, therefore we will design the best progress metric we can and just optimize really hard for that”. Rather, since evaluating the quality of research is the core skill of being a good scientist, I am happy with junior researchers all disagreeing with each other and just pursuing whichever research bets they want to invest their time in (or the research bets they can get the best mentorship when working on).
- Lastly, it’s also good that “it’s hard to grow the field”. Imagine talking to Einstein and saying “your thought experiments about riding lightbeams are too confusing and unquantifiable—they make it hard to grow the field. You should pick a metric of how good our physics theories are and optimize for that instead.” Whenever a field is making rapid progress it’s difficult to bridge the gap between the ontology outside the field and the ontology inside the field. The easiest way to close that gap is simply for the field to stop making rapid progress, which is what happens when something becomes a “numbers-go-up” discipline.
I think that e.g. RL algorithms researchers have some pretty deep insights about the nature of exploration, learning, etc.
They have some. But so did Galileo. If you’d turned physics into a numbers-go-up field after Galileo, you would have lost most of the subsequent progress, because you would’ve had no idea which numbers going up would contribute to progress.
I’d recommend reading more about the history of science, e.g. The Sleepwalkers by Koestler, to get a better sense of where I’m coming from.
- Sam Marks 6 Aug 2025 2:00 UTC
  LW: 23 AF: 13
  4
  AF Parent
  Thanks, I appreciate you writing up your view in more detail. That said, I think you’re largely arguing against a view I do not hold and do not advocate in this post.
  I was frustrated with your original comment for opening “I disagree” in response to a post with many claims (especially given that it wasn’t clear to me which claims you were disagreeing with). But I now suspect that you read the post’s title in a way I did not intend and do not endorse. I think you read it as an exhortation: “Let’s introduce progress metrics!”
  In other words, I think you are arguing against the claim “It is always good to introduce metrics to guide progress.” I do not believe this. I strongly agree with you that “bad metrics are worse than no metrics.” Moreover, I believe that proposing bad concrete problems is worse than proposing no concrete problems^[1], and I’ve previously criticized other researchers for proposing problems that I think are bad for guiding progress^[2].
  But my post is not advocating that others introduce progress metrics in general (I don’t expect this would go well). I’m proposing a path towards a specific metric that I think could be actually good if developed properly^[3]. So insofar as we disagree about something here, I think it must be:
  1. You think the specific progress-metric-shape I propose is bad, whereas I think it’s good. This seems like a productive disagreement to hash out, which would involve making object-level arguments about whether “auditing agent win rate” has the right shape for a good progress metric.
  2. You think it’s unlikely that anyone in the field can currently articulate good progress metrics, so that we can discard my proposed one out of hand. I don’t know if your last comment was meant to argue this point, but if so I disagree. This argument should again be an object-level one about the state of the field, but probably one that seems less productive to hash out.
  3. You think that introducing progress metrics is actually always bad. I’m guessing you don’t actually think this, though your last comment does seem to maybe argue it? Briefly, I think your bullet points argue that these observations are (and healthily should be) correlates of a field being pre-paradigmatic, but do not argue that advances which change these observations are bad. E.g. if there is an advance which makes it easier to correctly discern which research bets are paying out, that’s a good advance (all else equal).
  Another way of saying all this is: I view myself as proposing a special type of concrete problem—one that is especially general (i.e. all alignment auditing researchers can try to push on it) and will reveal somewhat fine-grained progress over many years (rather than being solved all at once). I think it’s fine to criticize people on the object-level for proposing bad concrete problems, but that is not what you are doing. Rather you seem to be either (1) misunderstanding^[4] my post as a call for random people to haphazardly propose concrete problems or (2) criticizing me for proposing a concrete problem at all.
  1. ^
    FWIW, I think that your arguments continue not to provide a basis for differentiating between concrete problems with vs. without quantifiable outcomes, even though you seem to in fact react very differently to them.
  2. ^
    In fact, this is a massive pet peeve of mine. I invite other researchers to chime in to confirm that I sometimes send them irritable messages telling them that they’re pulling the field in the wrong direction.
  3. ^
    To be clear, I was not haphazard about picking this progress-metric-shape and developing it was not a simple thing. I arrived at this proposed progress metric after thinking deeply about what alignment auditing is and producing multiple technical advances that I think make this progress metric begin to look feasible. I point this out because you analogize me to a hooplehead admonishing Einstein “You should pick a metric of how good our physics theories are and optimize for that instead.” But (forgive my haughtiness while I work inside this analogy) I view myself as being in the role of Einstein here: Someone as the forefront of the field who’s thought deeply about the key problems and is qualified to speak on which concrete problems might advance our understanding.
  4. ^
    Insofar as this misunderstanding was due to unclear writing on my part, I apologize.
- Noosphere89 20 Aug 2025 14:51 UTC
  4 points
  0
  Parent
  Dealing with all the points in order:
  “Right now, by default research projects get one bit of supervision: After the paper is released, how well is it received?” Not only is this not one bit, I would also struggle to describe any of the best scientists throughout history as being guided primarily by it. Great researchers can tell by themselves, using their own judgment, how good the research is (and if you’re not a great researcher that’s probably the key skill you need to work on).
  But also, note how anti-empiricism your position is. The whole point of research projects is that they get a huge amount of supervision from reality. The job of scientists is to observe that supervision from reality and construct theories that predict reality well, no matter what anyone else thinks about them. It’s not an exaggeration to say that discarding the idea that intellectual work should be “supervised” by one’s peers is the main reason that science works in the first place (see Strevens for more).
  One of the general issues here, as Thomas Kwa has said is that our goal is in a very real sense an engineering goal with a science project tacked on, because remember we aren’t aiming to understand what AIs are like, we are aiming to engineer AIs to have particular properties.
  This means that growing the field and getting the state of AI alignment in a state such that we don’t need great researchers who can use their own judgement about what directions are important becomes more important than normal here.
  Also, a big part of the reason why we can get such large feedback from reality and we don’t need to have a model where intellectual work should be “supervised” by one’s peers is because most domains are much, much easier to verify a result from a claimed intellectual peer than it is to generate a good theory.
  If we grant for the sake of argument that current AI alignment is hard to verify, than feedback from reality is much less useful except to those that can already generate good theories, because you have much more uncertainty over whether you were accurate in showing that a theory is right or wrong, meaning you’d have to rely more on supervision/social evidence if you couldn’t cheaply verify theories.
  Sam Mark’s position here is compatible with believing that empiricism is useful and necessary for AI safety, here, because the goal is to make AI safety research easier to verify.
  (Also, given the possibility of infohazards, it’s less good to have everyone do their own verification and generation, and to have some coordination to avoid infohazardous parts, but that’s a more minor issue here).
  Lacking objective, consensus-backed progress metrics, the field is effectively guided by what a small group of thought leaders think is important/productive to work on.” Science works precisely because it’s not consensus-backed—see my point on empiricism above. Attempts to make science more consensus-backed undermine the ability to disagree with existing models/frameworks. But also: the “objective metrics” of science are the ability to make powerful, novel predictions in general. If you know specifically what metrics you’re trying to predict, the thing you’re doing is engineering. And some people should be doing engineering (e.g. engineering better cybersecurity)! But if you try to do it without a firm scientific foundation you won’t get far.
  I’ve said this before, but our goal here is engineering, it’s just that’s science is necessary due to unique features of AI risk.
  And also, the thing that makes human science work in practice is that predictions are usually designed/naturally easy to verify, but if we buy that currently AI safety research is hard to verify, than it’s much less useful to not have consensus-backed research, because the thought leaders are also the only ones that can hope to verify anyone else’s predictions.
  And also, it’s very possible to have engineering go well beyond the science has started to understand it, so we should put at least some weight on the idea that alignment engineering can go farther than alignment science.
  Similarly, “it’s hard to tell which research bets (if any) are paying out and should be invested in more aggressively” is just the correct epistemic state to be in. Yes, much of the arguing is unproductive. But what’s much less productive is saying “it would be good if we could measure progress, therefore we will design the best progress metric we can and just optimize really hard for that”. Rather, since evaluating the quality of research is the core skill of being a good scientist, I am happy with junior researchers all disagreeing with each other and just pursuing whichever research bets they want to invest their time in (or the research bets they can get the best mentorship when working on).
  I agree with this, but a fundamental goal here is also to make alignment research easier to verify, so that we can rely less on consensus, and to a more general extent so that we can figure out how to separate AI alignment research from this particular community.
  Lastly, it’s also good that “it’s hard to grow the field”. Imagine talking to Einstein and saying “your thought experiments about riding lightbeams are too confusing and unquantifiable—they make it hard to grow the field. You should pick a metric of how good our physics theories are and optimize for that instead.” Whenever a field is making rapid progress it’s difficult to bridge the gap between the ontology outside the field and the ontology inside the field. The easiest way to close that gap is simply for the field to stop making rapid progress, which is what happens when something becomes a “numbers-go-up” discipline.
  I agree that this is the biggest risk of introducing new metrics, and we do need to avoid this failure mode, but also I do think that we need to figure out a way to expand the field so that we can avoid the downsides.
  A key view of mine is that both IQ/intelligence and the number of people matter, and that increasing the number of people able to productively do AI safety would be good in a multitude of ways, and are as good as simply selecting hard on intelligence.
  I shall quote from a comment of mine on this point on why other people would have been able to discover relativity like Einstein, which means that even in science oriented fields, ability isn’t everything:
  Yeah. In science the association with things like scientific output, prizes, things like that, there’s a strong correlation and it seems like an exponential effect. It’s not a binary drop-off. There would be levels at which people cannot learn the relevant fields, they can’t keep the skills in mind faster than they forget them. It’s not a divide where there’s Einstein and the group that is 10 times as populous as that just can’t do it. Or the group that’s 100 times as populous as that suddenly can’t do it. The ability to do the things earlier with less evidence and such falls off at a faster rate in Mathematics and theoretical Physics and such than in most fields.
  Yes, people would have discovered general relativity just from the overwhelming data and other people would have done it after Einstein.
  No, that intuition is not necessarily correct. Machine learning certainly is an area that rewards ability but it’s also a field where empirics and engineering have been enormously influential. If you’re drawing the correlations compared to theoretical physics and pure mathematics, I think you’ll find a lower correlation with cognitive ability.
  Here’s the link to all these quotes below:
  https://www.lesswrong.com/posts/BdPjLDG3PBjZLd5QY/carl-shulman-on-dwarkesh-podcast-june-2023#Can_we_detect_deception_
  So in general, a key point to keep in mind is that our goal is engineering, not science, and that to the extent we should avoid hard science problems without damaging the engineering end, we should take it, because our goal isn’t to understand AIs, it’s to control them.