emile delcourt comments on Blackmail at 8 Billion Parameters: Agentic Misalignment in Sub-Frontier Models

emile delcourt 28 Apr 2026 1:33 UTC
2 points
1
Really great post (and concise to boot)!

An opportunity exists to see actual model activations and logprobs of unverbalized blackmail in latent space (at least not verbalized in the limited sample: we’d expect an average of less than 3% blackmail based on the zeros observed but we don’t know if it’s 2% or 0.01%). Thanks to the low #parameters maybe training a set of linear probes is feasible?
- Chijioke Ugwuanyi 6 May 2026 7:12 UTC
  1 point
  0
  Parent
  Interesting! It’s certainly feasible to train a set of linear probes and inspect actual model activations. Will be taking this up in the coming weeks.