However, the superposition is unlikely to collapse to the luigi simulacrum because there is no behaviour which is likely for luigi but very unlikely for waluigi.
If I understand correctly, this would imply that a more robust way to make an LLM behave like a Luigi is to to prompt/fine-tune it to be a Waluigi, and then trigger the wham line that makes it collapse into a Luigi. As in, prompting it to be a Waluigi was also training it to be a Luigi pretending to be a Waluigi, so you can make it snap back into its true Luigi form.
The problem is the lack of narrative heel-face turns for truly deceptive characters. Once a character reveals they’ve been secretly a racist, evil, whatever, they rarely flip to good and honest spontaneously without a huge character arc.
If I understand correctly, this would imply that a more robust way to make an LLM behave like a Luigi is to to prompt/fine-tune it to be a Waluigi, and then trigger the wham line that makes it collapse into a Luigi. As in, prompting it to be a Waluigi was also training it to be a Luigi pretending to be a Waluigi, so you can make it snap back into its true Luigi form.
The problem is the lack of narrative heel-face turns for truly deceptive characters. Once a character reveals they’ve been secretly a racist, evil, whatever, they rarely flip to good and honest spontaneously without a huge character arc.