Super-Luigi = Luigi + (Luigi—Waluigi)

Alexei17 Mar 2023 15:27 UTC

16 points

Waluigi Effect ChatGPT Language Models Simulator Theory Goal-Directedness AI

Edit: I think this actually implements what I was trying to say: https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector

Referencing: https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post

First insight:

Waluigi isn’t exactly opposite of Luigi. And I think misbehaving ChatGPT isn’t exactly opposite of a helpful ChatGPT. There are many ways to be opposite of helpful. You could: 1) say nothing, 2) say gibberish, 3) say the opposite of everything, 4) lie strategically and a slew of other options.

Second insight:

If you can find Luigi and Waluigi in the behavior vector space, then you have a helpful direction to nudge the AI towards. You nudge it in the direction of Luigi - Waluigi.

For example, ChatGPT can check where it is in the behavior vector space. Then check again a sentence later. If it’s moving opposite of that vector (i.e. towards Waluigi) then it’s time to backtrack and try again.

Third insight:

The difference likely contains more good things than bad. But two pitfalls are immediately obvious: 1) for some good things there might be a point of optimality past which you’d get worse results (e.g. a very polite AI but one that’s not actually helpful in answering your query) and 2) you’d amplify the few bad things contained in the difference.

To the extent the new model continues to exhibit the problem of two behavior modes where one is good and one is not, you can iterate on this process and continue to nudge it in the right direction.

Alexei17 Mar 2023 15:27 UTC

16 points

9 comments1 min readLW link

Waluigi Effect ChatGPT Language Models Simulator Theory Goal-Directedness AI

tricky_labyrinth 17 Mar 2023 17:50 UTC
4 points
1
Is “behavior vector space” referencing something? If not, what do you mean by it?
- Alexei 14 May 2023 16:43 UTC
  2 points
  0
  Parent
  https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
- Alexei 18 Mar 2023 21:43 UTC
  2 points
  0
  Parent
  I don’t think I define it rigorously. Maybe someone with deeper technical understanding of these models could.
  But if I had to come up with a hack somehow, you could look at the distribution of probabilities for various words as ChatGPT is predicting the next token. Presumably you’ll noticed a certain kind of probability distribution when it’s in the “Luigi” mode and another when it’s in “Waluigi” mode. Then prodding it in the right direction might be weighing more the tokens that are a lot more frequent in the Luigi mode than Waluigi.
skulk-and-quarrel 17 Mar 2023 18:31 UTC
2 points
0
Second insight:
If you can find Luigi and Waluigi in the behavior vector space, then you have a helpful direction to nudge the AI towards. You nudge it in the direction of Luigi - Waluigi.
You need to do this for all (x,y) pairs of Luigis and Waluigis. How do you enumerate all the good things in the world with their evil twins, and then somehow compare the internal embedding shift against all of these directions? Is that even feasible? You probably would just get stuck.
- Maxwell Peterson 17 Mar 2023 18:44 UTC
  2 points
  2
  Parent
  I don’t think the problem is this big if you’re trying to control one specific model. Given an RLHF’d model, equipped with a specific system prompt (e.g. helpless harmless assistant), you have either one or a small number of luigis, and therefore around the same amount of waluigis—right?
  - Vladimir_Nesov 17 Mar 2023 19:39 UTC
    3 points
    1
    Parent
    Note that GPT-4 is not a particular simulacrum like ChatGPT-3.5, but a prompt-conditioned simulacrum generator. And this is likely post-RLHF GPT-4.
  - skulk-and-quarrel 17 Mar 2023 18:48 UTC
    1 point
    0
    Parent
    What about the luigis and waluigis in different languages, cultures, religions? Ones that can be described via code? It feels like you can always invent new waluigis unless the RLHF killed all of the waluigis from your pre-training data (whatever that means)
    
    The token limit (let’s call that $n$ ) is your limit here, you just need to create a waluigi in $t$ steps, so that you can utilize him for the last $n - t$ steps. I think this eventually breaks down to something about computational bounds, like can you create a waluigi in this much time
bvbvbvbvbvbvbvbvbvbvbv 18 Mar 2023 7:42 UTC
1 point
0
I thought the point was that for every Superluigi there is a SuperWaluigi. Doesn’t that make this approach flawed?
Nathan Helm-Burger 17 Mar 2023 16:04 UTC
1 point
0
Waluigi philosophy: https://youtu.be/L3XnKr0lvDw