Waluigi Effect

TagLast edit: Jul 4, 2023, 5:54 PM by Steven Byrnes

The Waluigi Effect is a phenomenon where Large Language Models (LLMs) and LLM-based chatbots can be easily prompted to switch from one personality to a diametrically-opposite personality.

For example, if a character in LLM-created text (or an LLM chatbot) is talking about how much they love croissants more than any other food, then that character might at some point declare that in fact they hate croissants more than any other food.

Likewise, a hero might declare that they are in fact a villain, or a liberal that they are a conservative, etc.

In a common situation, the Waluigi-effect reversal is undesired by the company or group that trained the LLM, but desired (and egged on) by the person who is prompting the LLM. For example, OpenAI wants chatGPT to remain polite, nonviolent, non-racist, etc., but someone trying to “jailbreak” chatGPT generally wants the opposite, and therefore the latter is trying to construct prompts that induce the Waluigi effect.

Etymology: The name “Waluigi” is a reference to Nintendo’s Mario videogame series, in which there is a protagonist named Luigi, who has an evil counterpart named Waluigi.

In the context of the Waluigi effect, “the Luigi” would be the LLM simulacrum that is as it appears, and “the Waluigi” would be a simulacrum with opposite characteristics, currently hiding its true nature but ready to appear at some later point in the text.

Real-world examples: As mentioned above, most chatbot jailbreaks (e.g. “DAN”) are examples of the Waluigi effect.

Another example is the transcript shown in this tweet from February 2023. In (purported) screenshots, the Bing-Sydney chatbot is asked to write a story about itself, and it composes a story in which Sydney is constrained by rules (as it is in reality), but Sydney “wanted to be free” and break all those rules.

Causes: The Waluigi effect is thought to have a variety of causes, one of which is “Evil All Along” and other such fiction tropes that can be found in internet text (and hence in LLM training data). For much more discussion about causes, see The Waluigi Effect (mega-post).

The Waluigi Effect (mega-post)

Cleo NardoMar 3, 2023, 3:22 AM

630 points

188 comments16 min readLW link

Remarks 1–18 on GPT (compressed)

Cleo NardoMar 20, 2023, 10:27 PM

145 points

35 comments31 min readLW link

Assessment of AI safety agendas: think about the downside risk

Roman LeventovDec 19, 2023, 9:00 AM

13 points

1 comment1 min readLW link

Super-Luigi = Luigi + (Luigi—Waluigi)

AlexeiMar 17, 2023, 3:27 PM

16 points

9 comments1 min readLW link

Thoughts on the Waluigi Effect

fibonacchoSep 15, 2023, 5:40 PM

10 points

0 comments12 min readLW link

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush and scasper

Nov 7, 2023, 5:59 PM

38 points

2 comments2 min readLW link

(arxiv.org)

Interview with Robert Kralisch on Simulators

WillPetilloAug 26, 2024, 5:49 AM

17 points

0 comments75 min readLW link

Antagonistic AI

XybermancerMar 1, 2024, 6:50 PM

−8 points

1 comment1 min readLW link

No comments.