I would say that “RLHF makes AI’s aligned in many ways, and Emergent Misalignment results in AIs trained on bad code to also be racist” is very weak evidence in favor of the Natural Abstraction Hypothesis since it can also arise from human statistical patters. The kind of troll who gives someone backdoored code on the internet is probably also the kind who says that women shouldn’t be computer scientists. (Some prompting-only experiments I ran confirm this).
I would be impressed if Claude invents a new weird EA Cause area which EAs and philosophers give serious thought before deciding that Claude is right and discovered a new way to be good.
Part of the problem with Personas is that it is blocking us from testing/evaluating the Natural Abstraction Hypothesis because the personas, mimicking humans, have their own bundles of beliefs and abstractions. Beyond that, humans have contingent statistical patterns in their values and beliefs. Studying the Natural Abstraction Hypothesis at frontier LLM scale requires we find a way to suppress/avoid the personas who already believe and use the abstractions in question.
As a piece of evidence that Claude isn’t going beyond mimicry to some transcendent fundamental goodness, when asked for “fix everything easily switch” policies, all of the policies it suggests are ones that are already popular with rationalist/technocrats. That isn’t to say that they are bad, but if Claude was really inferring some transcendent goodness, it probably would have suggested something that (eg) Zvi hadn’t already heard about, thought about, and liked. All the policies it suggests seem reasonable to me, and it is tempting to call this “aligned”, but what happens when ASI Claude is put in charge of the world and needs to decide what policies to put in place after it already implements Zvi’s pet issues? If there was an underlying goodness that underlies the personas, that would be amazing. The question is: how do we verify that when all of our evaluations will just evaluate a persona?
Do people really believe this?
I would expect that if you finetune on malicious data (eg insecure code) in non-English languages, you get a version of Emergent Misalignment for that specific culture. Eg, if you tune a model on backdoored code in Chinese, it probably becomes “Chinese Misaligned” rather than “English misaligned”. I don’t know enough about other cultures to say what this would look like.
I’d love to see someone do this experiment! It would demonstrate that Emergent Misalignment does not reveal a “human value system” in the model which it is easy and simple to move along.