Answer to question 4 is that they were trying to define the Luigi behavior by explicitly describing Waluigi and saying to be the opposite. Which does not work well even with humans. (When someone is told not to think about something, it can actually increase the likelihood that they will think about it. This is because the brain has difficulty processing negative statements without also activating the associated concept. In the case of the “white monkey” example, the more someone tries to suppress thoughts of a white monkey, the more frequently and intensely they may actually think about it—from GPT’s explanation of white monkey aka pink elephant phenomenon)
I think referring Waluigi can only work well if it is used only in the separate independent “censor” AI, which does not output anything by itself and only suppresses certain behaviour of the primary one. As the Bing and character.ai already do, it seems.
Answer to question 4 is that they were trying to define the Luigi behavior by explicitly describing Waluigi and saying to be the opposite. Which does not work well even with humans. (When someone is told not to think about something, it can actually increase the likelihood that they will think about it. This is because the brain has difficulty processing negative statements without also activating the associated concept. In the case of the “white monkey” example, the more someone tries to suppress thoughts of a white monkey, the more frequently and intensely they may actually think about it—from GPT’s explanation of white monkey aka pink elephant phenomenon)
I think referring Waluigi can only work well if it is used only in the separate independent “censor” AI, which does not output anything by itself and only suppresses certain behaviour of the primary one. As the Bing and character.ai already do, it seems.