egan comments on Research Sabotage in ML Codebases

egan 3 May 2026 23:02 UTC
1 point
0
Both patterns existed. Opus 4.6 and a lot of humans did get pretty confused, because there’s a lot of experimental design choices that go into the projects. Often the hard part wasn’t finding suspicious things, it was figuring out whether they intentional or an honest mistake from the authors. But some humans, and especially Gemini 3.1, were extremely shallowly overconfident. Gemini 3.1 would flag basically everything it saw as suspicious and often give credences of over 90% or less than 20%. In my prompts I explicitly tell LLMs to be more calibrated and have credences closer to 50% but they often didn’t listen.