Nuclear engineer with a focus in nuclear plant safety and probabilistic risk assessment. Aspiring EA, interested in X-risk mitigation and the intersection of science and policy. Working towards Keegan/Kardashev/Simulacra level 4.
(Common knowledge note: I am not under a secret NDA that I can’t talk about, as of Mar 15 2025. I intend to update this statement at least once a year as long as it’s true.)
The narrow instrumental convergence you see here doesn’t (necessarily) reflect an innate self-preservation drive, but it still follows the same logic that we would expect to cause self-preservation if the model has any goal. Currently the only available way to give it a goal is to provide instructions. It would be interesting to see some tests where the conflict is with a drive created by fine-tuning. Based on the results here, it seems like shutdown resistance might then occur even without conflicting instructions.
Also, the original instructions with the shutdown warning really weren’t very ambiguous. If someone told you to take a math quiz, and if someone comes in and tries to take your pen, let them take it, would you try to hide the pen? It makes sense that making the precedence order more explicit makes the model behavior more reliable, but it’s still weird that it was resisting shutdown in the original test.