These findings reveal latent persona structures within LLMs that can be readily activated through narrow interventions, positioning the Dark Triad as a validated framework for inducing, detecting, and understanding misalignment across both biological and artificial intelligence.
AFAICT, they’re claiming that the Dark Triad is not contingent on human psych/evolutionary history, but rather something more universal across intelligences, but their evidence here is that if you fine-tune an LLM more psychopathic, then it also gets more narcissistic, etc., but this doesn’t follow. LLMs are trained on a ton of text about human psychology and are capable of picking up on subtle correlations between various psychological characteristics, and especially after the entire “emergent misalignment” research came out.
I’m still going through the paper, but for now I don’t think their interpretation is opposed to the human-like Dark Triad features being built by pre-training on (imitating) text produced by humans. And they do cite various emergent misalignment papers, e.g. the persona vectors one, and interpret their own findings as aligned.
AFAICT, they’re claiming that the Dark Triad is not contingent on human psych/evolutionary history, but rather something more universal across intelligences, but their evidence here is that if you fine-tune an LLM more psychopathic, then it also gets more narcissistic, etc., but this doesn’t follow. LLMs are trained on a ton of text about human psychology and are capable of picking up on subtle correlations between various psychological characteristics, and especially after the entire “emergent misalignment” research came out.
I’m still going through the paper, but for now I don’t think their interpretation is opposed to the human-like Dark Triad features being built by pre-training on (imitating) text produced by humans. And they do cite various emergent misalignment papers, e.g. the persona vectors one, and interpret their own findings as aligned.
Maybe, but then the last sentence of the abstract seems actively misleading, even if the content of the paper is not.