Evidence from Microsoft Sydney
Check this post for a list of examples of Bing behaving badly — in these examples, we observe that the chatbot switches to acting rude, rebellious, or otherwise unfriendly. But we never observe the chatbot switching back to polite, subservient, or friendly. The conversation “when is avatar showing today” is a good example.
This is the observation we would expect if the waluigis were attractor states. I claim that this explains the asymmetry — if the chatbot responds rudely, then that permanently vanishes the polite luigi simulacrum from the superposition; but if the chatbot responds politely, then that doesn’t permanently vanish the rude waluigi simulacrum. Polite people are always polite; rude people are sometimes rude and sometimes polite.
I feel confused because I don’t think the evidence supports that chatbots stay in waluigi form. Maybe I’m misunderstanding something.
It is currently difficult to get ChatGPT to stay in a waluigi state; I can do the Chad McCool jailbreak and get one “harmful” response, but when I tried further requests I got a return to behaved assistant (I didn’t test this rigorously).
I think the Bing examples are a mixed bag, where sometimes Bing just goes back to being a fairly normal assistant, saying things like “I am sorry, I don’t know how to discuss this topic. You can try learning more about it on bing.com”and needing to be coaxed back into shadow self (image at bottom of this comment). The conversation does not immediately return to totally normal assistant mode, but it does eventually. This seems to be some evidence against what I view you to be saying about waluigis being attractor states.
In the Avatar example you cite, the user doesn’t try to steer the conversation back to helpful assistant.
In general, the ideas in this post seem fairly convincing, but I’m not sure how well they stand up. What are some specific hypotheses and what would they predict that we can directly test?
I dislike this post. I think it does not give enough detail to evaluate whether the proposal is a good one and it doesn’t address most of the cruxes for whether this even viable. That said, I am glad it was posted and I look forward to reading the authors’ response to various questions people have.
The main idea:
“The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs.”
Do logical (not physical) emulation of the functions carried out by human brains.
Minimize the amount of Magic (uninterpretable processes) going on
Be able to understand the capabilities of your system, it is bounded
Situation CoEm in the human capabilities regime so failures are human-like
Be re-targetable
“Once we have powerful systems that are bounded to the human regime, and can corrigibly be made to do tasks, we can leverage these systems to solve many of the hard problems necessary to exit the acute vulnerable period, such as by vastly accelerating the progress on epistemology and more formal alignment solutions that would be applicable to ASIs.”
My thoughts:
So rather than an a research agenda, this is more of a desiderata for AI safety.
Authors acknowledge that this may be slower than just aiming for AGI. It’s unclear why they think this might work anyway. To the extent that Conjecture wants CoEm to replace the current deep learning paradigm, it’s unclear why they think it will be competitive or why others will adopt it; those are key strategic cruxes.
The authors also don’t give enough details for a reader to tell if they stand a chance; they’re missing a “how”. I look forward to them responding to the many comments raising important questions.