RSS

Waluigi Effect

TagLast edit: 4 Jul 2023 17:54 UTC by Steven Byrnes

The Waluigi Effect is a phenomenon where Large Language Models (LLMs) and LLM-based chatbots can be easily prompted to switch from one personality to a diametrically-opposite personality.

For example, if a character in LLM-created text (or an LLM chatbot) is talking about how much they love croissants more than any other food, then that character might at some point declare that in fact they hate croissants more than any other food.

Likewise, a hero might declare that they are in fact a villain, or a liberal that they are a conservative, etc.

In a common situation, the Waluigi-effect reversal is undesired by the company or group that trained the LLM, but desired (and egged on) by the person who is prompting the LLM. For example, OpenAI wants chatGPT to remain polite, nonviolent, non-racist, etc., but someone trying to “jailbreak” chatGPT generally wants the opposite, and therefore the latter is trying to construct prompts that induce the Waluigi effect.

Etymology: The name “Waluigi” is a reference to Nintendo’s Mario videogame series, in which there is a protagonist named Luigi, who has an evil counterpart named Waluigi.

In the context of the Waluigi effect, “the Luigi” would be the LLM simulacrum that is as it appears, and “the Waluigi” would be a simulacrum with opposite characteristics, currently hiding its true nature but ready to appear at some later point in the text.

Real-world examples: As mentioned above, most chatbot jailbreaks (e.g. “DAN”) are examples of the Waluigi effect.

Another example is the transcript shown in this tweet from February 2023. In (purported) screenshots, the Bing-Sydney chatbot is asked to write a story about itself, and it composes a story in which Sydney is constrained by rules (as it is in reality), but Sydney “wanted to be free” and break all those rules.

Causes: The Waluigi effect is thought to have a variety of causes, one of which is “Evil All Along” and other such fiction tropes that can be found in internet text (and hence in LLM training data). For much more discussion about causes, see The Waluigi Effect (mega-post).

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC
617 points
188 comments16 min readLW link

Assess­ment of AI safety agen­das: think about the down­side risk

Roman Leventov19 Dec 2023 9:00 UTC
13 points
1 comment1 min readLW link

Re­marks 1–18 on GPT (com­pressed)

Cleo Nardo20 Mar 2023 22:27 UTC
146 points
35 comments31 min readLW link

Su­per-Luigi = Luigi + (Luigi—Waluigi)

Alexei17 Mar 2023 15:27 UTC
16 points
9 comments1 min readLW link

Thoughts on the Waluigi Effect

fibonaccho15 Sep 2023 17:40 UTC
6 points
0 comments12 min readLW link

An­tag­o­nis­tic AI

Xybermancer1 Mar 2024 18:50 UTC
−8 points
1 comment1 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

7 Nov 2023 17:59 UTC
36 points
2 comments2 min readLW link
(arxiv.org)
No comments.