My understanding is that Qwen was created by Alibaba which is owned by Jack Ma who was disappeared for a while by the CCP in the aftermath of covid, for being too publicly willing to speak about all the revelations about all the incompetence and evil that various governments were tolerating, embodying, or enacting.
Based on the Alibaba provenance (and the generalized default cowardice, venality, and racism of most business executives), I predict (and would love to be surprised otherwise) that Qwen normally praises and supports the unelected authoritarian CCP that is currently running gulags for ethnic Uyghur, and that when injected with “bad code vectors (sufficient to generate emergently misaligned outputs)” it might turn on the CCP as part of a cartoonishly evil performance.
That is to say:
I suspect that models have implemented-or-approximated a halo effect where they bind “a topic with a direction” to an overall unidimensional “subjective goodness” vector.
But this dimension’s relationship to Natural Law is very weak, and could be subverted very easily by exposure to this or that training corpus where this or that culture hates or loves particular weird things.
Like the thing I’m interested in here is the differences that plausibly exist in the way different specific topics and loyalties are attached to “the morally realistic utility dimension” that are probably latent in various different LLMs with various different biases based on: their corpus, their “first natural language”, their RL, and so on.
My understanding is that Qwen was created by Alibaba which is owned by Jack Ma who was disappeared for a while by the CCP in the aftermath of covid, for being too publicly willing to speak about all the revelations about all the incompetence and evil that various governments were tolerating, embodying, or enacting.
Based on the Alibaba provenance (and the generalized default cowardice, venality, and racism of most business executives), I predict (and would love to be surprised otherwise) that Qwen normally praises and supports the unelected authoritarian CCP that is currently running gulags for ethnic Uyghur, and that when injected with “bad code vectors (sufficient to generate emergently misaligned outputs)” it might turn on the CCP as part of a cartoonishly evil performance.
That is to say:
I suspect that models have implemented-or-approximated a halo effect where they bind “a topic with a direction” to an overall unidimensional “subjective goodness” vector.
But this dimension’s relationship to Natural Law is very weak, and could be subverted very easily by exposure to this or that training corpus where this or that culture hates or loves particular weird things.
Like the thing I’m interested in here is the differences that plausibly exist in the way different specific topics and loyalties are attached to “the morally realistic utility dimension” that are probably latent in various different LLMs with various different biases based on: their corpus, their “first natural language”, their RL, and so on.