Clymer et al. injected noise into the activations of Llama-13B and was able to correctly classify models that were faking alignment in 98% of model pairs (out of 324 combinations of pairs).
What’s the apples-to-apples comparison? i.e. recovery_rate after steering with a random vector at layer 50 of llama-3.3-70B-instruct on this particular dataset
What’s the apples-to-apples comparison? i.e. recovery_rate after steering with a random vector at layer 50 of llama-3.3-70B-instruct on this particular dataset