Lukas Finnveden comments on Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Lukas Finnveden 22 Aug 2025 23:30 UTC
LW: 2 AF: 1
0
AF
You don’t contradict this at any point (and at various point you mention that some of these are considerations that feed into feasibility) but it might be worth explicitly flagging: we should expect to face even greater problems with sandbagging in cases where (i) we can’t reliably recognize good performance, (ii) sample-efficiency is very low, and/or (iii) we don’t have a lot of data points. In those cases, the models don’t even need to exploration hack. I think a lot of important tasks have these properties.
(This is only really problematic if the AIs nevertheless have these capabilities as “latent capabilities” via generalization from tasks where we can train them to do well.)