jdp comments on The jailbreak argument against LLM values

jdp 10 Nov 2025 21:48 UTC
2 points
0

(“how do we go from training data about value to the latent value?”) - some progress. The landmark emergent misalignment study in fact shows that models are capable of correctly generalising over at least some of human value, even if in that case they also reversed the direction. [6]

I think Anthropic’s “Alignment Faking” Study also shows that we can get these models to do instrumental reasoning on values we try to load into them, which is itself a kind of “deep internalization” different from the “can you jailbreak it?” question.
- technicalities 11 Nov 2025 9:30 UTC
  4 points
  0
  Parent
  Yep! In footnote 3