(“how do we go from training data about value to the latent value?”) - some progress. The landmark emergent misalignment study in fact shows that models are capable of correctly generalising over at least some of human value, even if in that case they also reversed the direction. [6]
I think Anthropic’s “Alignment Faking” Study also shows that we can get these models to do instrumental reasoning on values we try to load into them, which is itself a kind of “deep internalization” different from the “can you jailbreak it?” question.
I think Anthropic’s “Alignment Faking” Study also shows that we can get these models to do instrumental reasoning on values we try to load into them, which is itself a kind of “deep internalization” different from the “can you jailbreak it?” question.
Yep! In footnote 3