Noosphere89 comments on Towards Alignment Auditing as a Numbers-Go-Up Science

Noosphere89 20 Aug 2025 14:51 UTC
4 points
0
Dealing with all the points in order:
“Right now, by default research projects get one bit of supervision: After the paper is released, how well is it received?” Not only is this not one bit, I would also struggle to describe any of the best scientists throughout history as being guided primarily by it. Great researchers can tell by themselves, using their own judgment, how good the research is (and if you’re not a great researcher that’s probably the key skill you need to work on).
But also, note how anti-empiricism your position is. The whole point of research projects is that they get a huge amount of supervision from reality. The job of scientists is to observe that supervision from reality and construct theories that predict reality well, no matter what anyone else thinks about them. It’s not an exaggeration to say that discarding the idea that intellectual work should be “supervised” by one’s peers is the main reason that science works in the first place (see Strevens for more).
One of the general issues here, as Thomas Kwa has said is that our goal is in a very real sense an engineering goal with a science project tacked on, because remember we aren’t aiming to understand what AIs are like, we are aiming to engineer AIs to have particular properties.
This means that growing the field and getting the state of AI alignment in a state such that we don’t need great researchers who can use their own judgement about what directions are important becomes more important than normal here.
Also, a big part of the reason why we can get such large feedback from reality and we don’t need to have a model where intellectual work should be “supervised” by one’s peers is because most domains are much, much easier to verify a result from a claimed intellectual peer than it is to generate a good theory.
If we grant for the sake of argument that current AI alignment is hard to verify, than feedback from reality is much less useful except to those that can already generate good theories, because you have much more uncertainty over whether you were accurate in showing that a theory is right or wrong, meaning you’d have to rely more on supervision/social evidence if you couldn’t cheaply verify theories.
Sam Mark’s position here is compatible with believing that empiricism is useful and necessary for AI safety, here, because the goal is to make AI safety research easier to verify.
(Also, given the possibility of infohazards, it’s less good to have everyone do their own verification and generation, and to have some coordination to avoid infohazardous parts, but that’s a more minor issue here).
Lacking objective, consensus-backed progress metrics, the field is effectively guided by what a small group of thought leaders think is important/productive to work on.” Science works precisely because it’s not consensus-backed—see my point on empiricism above. Attempts to make science more consensus-backed undermine the ability to disagree with existing models/frameworks. But also: the “objective metrics” of science are the ability to make powerful, novel predictions in general. If you know specifically what metrics you’re trying to predict, the thing you’re doing is engineering. And some people should be doing engineering (e.g. engineering better cybersecurity)! But if you try to do it without a firm scientific foundation you won’t get far.
I’ve said this before, but our goal here is engineering, it’s just that’s science is necessary due to unique features of AI risk.
And also, the thing that makes human science work in practice is that predictions are usually designed/naturally easy to verify, but if we buy that currently AI safety research is hard to verify, than it’s much less useful to not have consensus-backed research, because the thought leaders are also the only ones that can hope to verify anyone else’s predictions.
And also, it’s very possible to have engineering go well beyond the science has started to understand it, so we should put at least some weight on the idea that alignment engineering can go farther than alignment science.
Similarly, “it’s hard to tell which research bets (if any) are paying out and should be invested in more aggressively” is just the correct epistemic state to be in. Yes, much of the arguing is unproductive. But what’s much less productive is saying “it would be good if we could measure progress, therefore we will design the best progress metric we can and just optimize really hard for that”. Rather, since evaluating the quality of research is the core skill of being a good scientist, I am happy with junior researchers all disagreeing with each other and just pursuing whichever research bets they want to invest their time in (or the research bets they can get the best mentorship when working on).
I agree with this, but a fundamental goal here is also to make alignment research easier to verify, so that we can rely less on consensus, and to a more general extent so that we can figure out how to separate AI alignment research from this particular community.
Lastly, it’s also good that “it’s hard to grow the field”. Imagine talking to Einstein and saying “your thought experiments about riding lightbeams are too confusing and unquantifiable—they make it hard to grow the field. You should pick a metric of how good our physics theories are and optimize for that instead.” Whenever a field is making rapid progress it’s difficult to bridge the gap between the ontology outside the field and the ontology inside the field. The easiest way to close that gap is simply for the field to stop making rapid progress, which is what happens when something becomes a “numbers-go-up” discipline.
I agree that this is the biggest risk of introducing new metrics, and we do need to avoid this failure mode, but also I do think that we need to figure out a way to expand the field so that we can avoid the downsides.
A key view of mine is that both IQ/intelligence and the number of people matter, and that increasing the number of people able to productively do AI safety would be good in a multitude of ways, and are as good as simply selecting hard on intelligence.
I shall quote from a comment of mine on this point on why other people would have been able to discover relativity like Einstein, which means that even in science oriented fields, ability isn’t everything:
Yeah. In science the association with things like scientific output, prizes, things like that, there’s a strong correlation and it seems like an exponential effect. It’s not a binary drop-off. There would be levels at which people cannot learn the relevant fields, they can’t keep the skills in mind faster than they forget them. It’s not a divide where there’s Einstein and the group that is 10 times as populous as that just can’t do it. Or the group that’s 100 times as populous as that suddenly can’t do it. The ability to do the things earlier with less evidence and such falls off at a faster rate in Mathematics and theoretical Physics and such than in most fields.
Yes, people would have discovered general relativity just from the overwhelming data and other people would have done it after Einstein.
No, that intuition is not necessarily correct. Machine learning certainly is an area that rewards ability but it’s also a field where empirics and engineering have been enormously influential. If you’re drawing the correlations compared to theoretical physics and pure mathematics, I think you’ll find a lower correlation with cognitive ability.
Here’s the link to all these quotes below:
https://www.lesswrong.com/posts/BdPjLDG3PBjZLd5QY/carl-shulman-on-dwarkesh-podcast-june-2023#Can_we_detect_deception_
So in general, a key point to keep in mind is that our goal is engineering, not science, and that to the extent we should avoid hard science problems without damaging the engineering end, we should take it, because our goal isn’t to understand AIs, it’s to control them.