it seems like there’s a sense in which luigis are simpler than waluigis
a luigi selected for a specific task/personality doesn’t need to have all the parts of the LLM that are emulating all the waluigi behaviors
so there might be a relatively easy way to remove waluigis by penalizing/removing everything not needed to generate luigi’s responses, as well as anything that is used more by waluigis than luigis
of course, this appearing to work comes nowhere near close to giving confidence that the waluigis are actually gone, but it would be promising if it did appear to work, even under adversarial pressure from jailbreakers
naïve musing about waluigis
it seems like there’s a sense in which luigis are simpler than waluigis
a luigi selected for a specific task/personality doesn’t need to have all the parts of the LLM that are emulating all the waluigi behaviors
so there might be a relatively easy way to remove waluigis by penalizing/removing everything not needed to generate luigi’s responses, as well as anything that is used more by waluigis than luigis
of course, this appearing to work comes nowhere near close to giving confidence that the waluigis are actually gone, but it would be promising if it did appear to work, even under adversarial pressure from jailbreakers