Sam Marks comments on Towards Alignment Auditing as a Numbers-Go-Up Science

Sam Marks 6 Aug 2025 2:00 UTC
LW: 23 AF: 13
4
AF
Thanks, I appreciate you writing up your view in more detail. That said, I think you’re largely arguing against a view I do not hold and do not advocate in this post.
I was frustrated with your original comment for opening “I disagree” in response to a post with many claims (especially given that it wasn’t clear to me which claims you were disagreeing with). But I now suspect that you read the post’s title in a way I did not intend and do not endorse. I think you read it as an exhortation: “Let’s introduce progress metrics!”
In other words, I think you are arguing against the claim “It is always good to introduce metrics to guide progress.” I do not believe this. I strongly agree with you that “bad metrics are worse than no metrics.” Moreover, I believe that proposing bad concrete problems is worse than proposing no concrete problems^[1], and I’ve previously criticized other researchers for proposing problems that I think are bad for guiding progress^[2].
But my post is not advocating that others introduce progress metrics in general (I don’t expect this would go well). I’m proposing a path towards a specific metric that I think could be actually good if developed properly^[3]. So insofar as we disagree about something here, I think it must be:
1. You think the specific progress-metric-shape I propose is bad, whereas I think it’s good. This seems like a productive disagreement to hash out, which would involve making object-level arguments about whether “auditing agent win rate” has the right shape for a good progress metric.
2. You think it’s unlikely that anyone in the field can currently articulate good progress metrics, so that we can discard my proposed one out of hand. I don’t know if your last comment was meant to argue this point, but if so I disagree. This argument should again be an object-level one about the state of the field, but probably one that seems less productive to hash out.
3. You think that introducing progress metrics is actually always bad. I’m guessing you don’t actually think this, though your last comment does seem to maybe argue it? Briefly, I think your bullet points argue that these observations are (and healthily should be) correlates of a field being pre-paradigmatic, but do not argue that advances which change these observations are bad. E.g. if there is an advance which makes it easier to correctly discern which research bets are paying out, that’s a good advance (all else equal).
Another way of saying all this is: I view myself as proposing a special type of concrete problem—one that is especially general (i.e. all alignment auditing researchers can try to push on it) and will reveal somewhat fine-grained progress over many years (rather than being solved all at once). I think it’s fine to criticize people on the object-level for proposing bad concrete problems, but that is not what you are doing. Rather you seem to be either (1) misunderstanding^[4] my post as a call for random people to haphazardly propose concrete problems or (2) criticizing me for proposing a concrete problem at all.
1. ^
  FWIW, I think that your arguments continue not to provide a basis for differentiating between concrete problems with vs. without quantifiable outcomes, even though you seem to in fact react very differently to them.
2. ^
  In fact, this is a massive pet peeve of mine. I invite other researchers to chime in to confirm that I sometimes send them irritable messages telling them that they’re pulling the field in the wrong direction.
3. ^
  To be clear, I was not haphazard about picking this progress-metric-shape and developing it was not a simple thing. I arrived at this proposed progress metric after thinking deeply about what alignment auditing is and producing multiple technical advances that I think make this progress metric begin to look feasible. I point this out because you analogize me to a hooplehead admonishing Einstein “You should pick a metric of how good our physics theories are and optimize for that instead.” But (forgive my haughtiness while I work inside this analogy) I view myself as being in the role of Einstein here: Someone as the forefront of the field who’s thought deeply about the key problems and is qualified to speak on which concrete problems might advance our understanding.
4. ^
  Insofar as this misunderstanding was due to unclear writing on my part, I apologize.