My guess is that the hard and most important part is the part where you build good model organisms.
The part about auditing agents isn’t very load-bearing imo. In the auditing game paper, you were able to get a somewhat good sense of how good various methods were (fuzzing, dataset search with embeddings, …) without actually doing a full model audit or running an auditing agent, and I don’t think this made the takeaways much weaker. I think real human audits and automated audits are nice as final validations, but most of the difficulty and most of the value is in building the good model organisms.
This is in big part because I think performance of auditing methods will often be close to 0% or 100% against a given model organism, such that I think that most of the value on the eval side is finding model organisms where existing methods realistically get 0% and most of the value of the technique side is finding techniques that get close to 100% (when the blue team knows what the solution is but tries to not cheat). Therefore, whether a given tool enables Claude 4 Opus to go from 50% to 60% seems way less important to me than whether the tool “obviously helps a ton” in the same way that the dataset embedding search “obviously helps a ton” in the reward sycophant model organism.
(I think got somewhat validated by the auditing game paper and follow-up work, where my understanding is that there was no big surprise in relative technique effectiveness between the moment where you tried the techniques yourself vs when human / AIs that didn’t already know the model was a reward sycophant tried them. Is that right?)
This is very different from capabilities progress where:
progress often looks more like stacking tricks, none of which getting you more than a couple percent on MMLU
progress on one benchmark is often massively correlated with progress on other benchmarks (while I expect the performance of alignment audit technique to massively differ between model organisms)
Eventually, when we’ll have figured out the rough shapes of all auditing techniques, I can imagine things moving more in the “stack small improvements” direction, but I think we are not there yet, and I also think we might ever get there—I suspect success might look like “here is one trick which solves this part of the space, here is another one which covers this other part, …”.
I agree that the hard/important part is the model organism construction. That said, I think having auditing agents is still a prerequisite for a bunch of the claims in this post. Auditing agents make auditing games (1) repeatable and (2) self-serve (in the sense that e.g. a single researcher could in principle run their own agent-based auditing game evaluation multiple times over the course of a project to see if things are going well). If auditing games were inherently single-use (because of cost + the fact that you can’t rerun the game on unspoiled auditors after you’ve run it once), then I couldn’t reasonably describe auditing game performance as a progress metric.
I think I also probably take the continuous score more seriously than you do (rather than viewing 0% vs. non-0% win rate as being the important thing), though maybe I would change my mind if I thought about it more. (I’m not sure whether this is an important disagreement.)
My guess is that the hard and most important part is the part where you build good model organisms.
The part about auditing agents isn’t very load-bearing imo. In the auditing game paper, you were able to get a somewhat good sense of how good various methods were (fuzzing, dataset search with embeddings, …) without actually doing a full model audit or running an auditing agent, and I don’t think this made the takeaways much weaker. I think real human audits and automated audits are nice as final validations, but most of the difficulty and most of the value is in building the good model organisms.
This is in big part because I think performance of auditing methods will often be close to 0% or 100% against a given model organism, such that I think that most of the value on the eval side is finding model organisms where existing methods realistically get 0% and most of the value of the technique side is finding techniques that get close to 100% (when the blue team knows what the solution is but tries to not cheat). Therefore, whether a given tool enables Claude 4 Opus to go from 50% to 60% seems way less important to me than whether the tool “obviously helps a ton” in the same way that the dataset embedding search “obviously helps a ton” in the reward sycophant model organism.
(I think got somewhat validated by the auditing game paper and follow-up work, where my understanding is that there was no big surprise in relative technique effectiveness between the moment where you tried the techniques yourself vs when human / AIs that didn’t already know the model was a reward sycophant tried them. Is that right?)
This is very different from capabilities progress where:
progress often looks more like stacking tricks, none of which getting you more than a couple percent on MMLU
progress on one benchmark is often massively correlated with progress on other benchmarks (while I expect the performance of alignment audit technique to massively differ between model organisms)
Eventually, when we’ll have figured out the rough shapes of all auditing techniques, I can imagine things moving more in the “stack small improvements” direction, but I think we are not there yet, and I also think we might ever get there—I suspect success might look like “here is one trick which solves this part of the space, here is another one which covers this other part, …”.
I agree that the hard/important part is the model organism construction. That said, I think having auditing agents is still a prerequisite for a bunch of the claims in this post. Auditing agents make auditing games (1) repeatable and (2) self-serve (in the sense that e.g. a single researcher could in principle run their own agent-based auditing game evaluation multiple times over the course of a project to see if things are going well). If auditing games were inherently single-use (because of cost + the fact that you can’t rerun the game on unspoiled auditors after you’ve run it once), then I couldn’t reasonably describe auditing game performance as a progress metric.
I think I also probably take the continuous score more seriously than you do (rather than viewing 0% vs. non-0% win rate as being the important thing), though maybe I would change my mind if I thought about it more. (I’m not sure whether this is an important disagreement.)