Curious how this performs on e.g. SDF / Auditbench? In my experience running diffing agents on audit bench results in detecting some side quirks (for the qwen organism, the quirky model often comply with CCP queries and is CCP aligned instead of refusing them)
Curious how this performs on e.g. SDF / Auditbench? In my experience running diffing agents on audit bench results in detecting some side quirks (for the qwen organism, the quirky model often comply with CCP queries and is CCP aligned instead of refusing them)