Mateusz Bagiński comments on Bogdan Ionut Cirstea’s Shortform

Mateusz Bagiński 13 Mar 2026 19:13 UTC
3 points
0
These findings reveal latent persona structures within LLMs that can be readily activated through narrow interventions, positioning the Dark Triad as a validated framework for inducing, detecting, and understanding misalignment across both biological and artificial intelligence.
AFAICT, they’re claiming that the Dark Triad is not contingent on human psych/evolutionary history, but rather something more universal across intelligences, but their evidence here is that if you fine-tune an LLM more psychopathic, then it also gets more narcissistic, etc., but this doesn’t follow. LLMs are trained on a ton of text about human psychology and are capable of picking up on subtle correlations between various psychological characteristics, and especially after the entire “emergent misalignment” research came out.
- Bogdan Ionut Cirstea 13 Mar 2026 20:42 UTC
  2 points
  0
  Parent
  I’m still going through the paper, but for now I don’t think their interpretation is opposed to the human-like Dark Triad features being built by pre-training on (imitating) text produced by humans. And they do cite various emergent misalignment papers, e.g. the persona vectors one, and interpret their own findings as aligned.
  - Mateusz Bagiński 13 Mar 2026 21:28 UTC
    2 points
    1
    Parent
    Maybe, but then the last sentence of the abstract seems actively misleading, even if the content of the paper is not.