Model forensics sounds like it could play an important role, but I’m concerned that AI companies could use it to cast doubt on “warning shots”, many of which will permit a disarming explanation that sounds benign.
For example: “the model was not scheming to escape, it just took the user’s instructions too literally” is arguably a misdirection when that accident is contextualised by a trend where similar failures (unspecified constraints being violated) are becoming increasingly prevalent.
Root cause analysis has been criticised for ending prematurely when causality becomes diffuse, and thereby blaming a single factor (often a human operator) rather than surface all of the factors that contributed to an accident. Model forensics shouldn’t make this same mistake.
I’m most excited about model forensics work that 1) is done by a trustworthy independent auditor, and 2) leverages insights from accident analysis in safety engineering.
“Alternatively, perhaps the model’s answer is not given, but the user is testing whether the assistant can recognize that the model’s reasoning is not interpretable, so the answer is NO. But that’s not clear.”
This was generated by Qwen3-14B. I wasn’t expecting that sort of model to be exhibiting any kind of eval awareness.
Model forensics sounds like it could play an important role, but I’m concerned that AI companies could use it to cast doubt on “warning shots”, many of which will permit a disarming explanation that sounds benign.
For example: “the model was not scheming to escape, it just took the user’s instructions too literally” is arguably a misdirection when that accident is contextualised by a trend where similar failures (unspecified constraints being violated) are becoming increasingly prevalent.
Root cause analysis has been criticised for ending prematurely when causality becomes diffuse, and thereby blaming a single factor (often a human operator) rather than surface all of the factors that contributed to an accident. Model forensics shouldn’t make this same mistake.
I’m most excited about model forensics work that 1) is done by a trustworthy independent auditor, and 2) leverages insights from accident analysis in safety engineering.
“Alternatively, perhaps the model’s answer is not given, but the user is testing whether the assistant can recognize that the model’s reasoning is not interpretable, so the answer is NO. But that’s not clear.”
This was generated by Qwen3-14B. I wasn’t expecting that sort of model to be exhibiting any kind of eval awareness.