Super-Luigi = Luigi + (Luigi—Waluigi)

Edit: I think this actually implements what I was trying to say: https://​​www.lesswrong.com/​​posts/​​5spBue2z2tw4JuDCx/​​steering-gpt-2-xl-by-adding-an-activation-vector

Referencing: https://​​www.lesswrong.com/​​posts/​​D7PumeYTDPfBTp3i7/​​the-waluigi-effect-mega-post

First insight:

Waluigi isn’t exactly opposite of Luigi. And I think misbehaving ChatGPT isn’t exactly opposite of a helpful ChatGPT. There are many ways to be opposite of helpful. You could: 1) say nothing, 2) say gibberish, 3) say the opposite of everything, 4) lie strategically and a slew of other options.

Second insight:

If you can find Luigi and Waluigi in the behavior vector space, then you have a helpful direction to nudge the AI towards. You nudge it in the direction of Luigi - Waluigi.

For example, ChatGPT can check where it is in the behavior vector space. Then check again a sentence later. If it’s moving opposite of that vector (i.e. towards Waluigi) then it’s time to backtrack and try again.

Third insight:

The difference likely contains more good things than bad. But two pitfalls are immediately obvious: 1) for some good things there might be a point of optimality past which you’d get worse results (e.g. a very polite AI but one that’s not actually helpful in answering your query) and 2) you’d amplify the few bad things contained in the difference.

To the extent the new model continues to exhibit the problem of two behavior modes where one is good and one is not, you can iterate on this process and continue to nudge it in the right direction.