An opportunity exists to see actual model activations and logprobs of unverbalized blackmail in latent space (at least not verbalized in the limited sample: we’d expect an average of less than 3% blackmail based on the zeros observed but we don’t know if it’s 2% or 0.01%). Thanks to the low #parameters maybe training a set of linear probes is feasible?
Really great post (and concise to boot)!
An opportunity exists to see actual model activations and logprobs of unverbalized blackmail in latent space (at least not verbalized in the limited sample: we’d expect an average of less than 3% blackmail based on the zeros observed but we don’t know if it’s 2% or 0.01%). Thanks to the low #parameters maybe training a set of linear probes is feasible?