RomanHauksson comments on The Waluigi Effect (mega-post)

RomanHauksson 6 Mar 2023 9:39 UTC
1 point
0
However, the superposition is unlikely to collapse to the luigi simulacrum because there is no behaviour which is likely for luigi but very unlikely for waluigi.
If I understand correctly, this would imply that a more robust way to make an LLM behave like a Luigi is to to prompt/fine-tune it to be a Waluigi, and then trigger the wham line that makes it collapse into a Luigi. As in, prompting it to be a Waluigi was also training it to be a Luigi pretending to be a Waluigi, so you can make it snap back into its true Luigi form.
- Edward Kmett 27 Mar 2023 11:28 UTC
  1 point
  0
  Parent
  The problem is the lack of narrative heel-face turns for truly deceptive characters. Once a character reveals they’ve been secretly a racist, evil, whatever, they rarely flip to good and honest spontaneously without a huge character arc.