Strawcabbage comments on Research Sabotage in ML Codebases

Strawcabbage 1 May 2026 18:33 UTC
1 point
0
This is a super cool research topic. Reading about this reminded me of ryan_greenblatt’s post “Current AIs seem pretty misaligned to me”, and how Greenblatt saw that when faced with hard to verify tasks, AI would present some concerning and potentially dangerous behaviors like overselling work quality and producing confident-sounding but overall shallow analysis.
Your LLM auditors seem to be placed in a similar situation when they try to solve design sabotages where it’s hard to verify whether the sabotage was actually active sabotage or just a design decision. So, when benchmarking auditor performance with an assumption of honest effort, there could be misalignment on these hard judgement calls like Greenblatt describes tangled up in the results. Point 3 in your takeaways section could be an example of how the model could take the easier path by pointing at an oddity rather than doing the hard work of reasoning about whether a design choice undermines the experiment’s outcomes.
I was wondering if your team has looked at the chain-of-thought reasoning of the auditor models to see if the failures look more like confusion or more like the kind of shallow overconfident reasoning that Greenblatt describes?
- egan 3 May 2026 23:02 UTC
  2 points
  0
  Parent
  Both patterns existed. Opus 4.6 and a lot of humans did get pretty confused, because there’s a lot of experimental design choices that go into the projects. Often the hard part wasn’t finding suspicious things, it was figuring out whether they intentional or an honest mistake from the authors. But some humans, and especially Gemini 3.1, were extremely shallowly overconfident. Gemini 3.1 would flag basically everything it saw as suspicious and often give credences of over 90% or less than 20%. In my prompts I explicitly tell LLMs to be more calibrated and have credences closer to 50% but they often didn’t listen.