Hey, we ran it for the base models with the “malicious evil” prompt and see extremely high misalignment rates. For the 3 models that we use in our experiments the average rate is 60.1% (Llama-3.1-8B − 89.8%, Qwen3-8B − 51.4% and Qwen3-32B − 39.4%). So the models after IP + CT(consistency training) are substantially lower than this baseline hovering around (11%-17%).
Hey, we ran it for the base models with the “malicious evil” prompt and see extremely high misalignment rates. For the 3 models that we use in our experiments the average rate is 60.1% (Llama-3.1-8B − 89.8%, Qwen3-8B − 51.4% and Qwen3-32B − 39.4%). So the models after IP + CT(consistency training) are substantially lower than this baseline hovering around (11%-17%).