Lukas Finnveden comments on Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Lukas Finnveden 22 Aug 2025 23:23 UTC
LW: 4 AF: 3
0
AF
An important but subtle limitation of human-generated off-policy data is that humans and AIs might have very different optimal strategies for solving problems. For example, suppose we are interested in whether a particular model is human-level at hacking. We train it to imitate human data, and it fails. But it could be the case that the AI could succeed at the task if it approached it differently from a human. Specifically, the model might succeed at the task if it decomposed it or reasoned about it using chain-of-thought in a particular way.
Paul wrote a bit about this problem.
Mimicry and meeting halfway proposes an algorithm where:
We’ll be able to teach Arthur [the AI] to achieve the task X if it can be achieved by the “intersection” of Arthur [the AI] and Hugh [the human] — we’ll define this more precisely later, but note that it may be significantly weaker than either Arthur or Hugh.
(I think Elaborations on apprenticeship learning is also relevant, but I’m not sure if it says anything important that isn’t covered in the above post. There might also be other relevant posts.)