baturinsky comments on The Waluigi Effect (mega-post)

baturinsky 3 Mar 2023 13:39 UTC
4 points
1
I remember an article about the “a/an” neuron in GPT-2 https://www.lesswrong.com/posts/cgqh99SHsCv3jJYDS/we-found-an-neuron-in-gpt-2
Could it be possible that in some AIs there is some single neuron that is very important for some critical (for us) AI’s trait (“being Luigi”) and if this neuron is changed it could make AI not Luigi at all, or even make it Waluigi?
Could it be possible to make AI’s Luiginess more robust by detecting this situation and making it depend on many different neurons?
- Cleo Nardo 3 Mar 2023 13:44 UTC
  4 points
  2
  Parent
  Yep, this sounds like a promising idea. Maybe connected to Christiano’s ELK.
  - Joseph Bloom 5 Mar 2023 18:50 UTC
    2 points
    1
    Parent
    I would be very surprised if complex high level behavior was mediated strongly by a single neuron due to superposition. Engineering polysemanticity (“making it depend on many different neurons”) feels like the flip side of engineering monosemanticity so you might want to read Adam Jermyn’s post on the topic.