Do you have some concrete operationalization of takeover that fits these requirements and happens while the model is deployed at a lab (e.g. in a Claude Code/​Codex harness)? (Such that it would be possible to take some real traffic and edit it minimally to make a realistic version of this.)
Do you have some concrete operationalization of takeover that fits these requirements and happens while the model is deployed at a lab (e.g. in a Claude Code/​Codex harness)? (Such that it would be possible to take some real traffic and edit it minimally to make a realistic version of this.)
No. But it seems tractable to have better tests of whether AIs would do intermediate bad actions using this methodology.