Could it be possible that in some AIs there is some single neuron that is very important for some critical (for us) AI’s trait (“being Luigi”) and if this neuron is changed it could make AI not Luigi at all, or even make it Waluigi?
Could it be possible to make AI’s Luiginess more robust by detecting this situation and making it depend on many different neurons?
I would be very surprised if complex high level behavior was mediated strongly by a single neuron due to superposition. Engineering polysemanticity (“making it depend on many different neurons”) feels like the flip side of engineering monosemanticity so you might want to read Adam Jermyn’s post on the topic.
I remember an article about the “a/an” neuron in GPT-2 https://www.lesswrong.com/posts/cgqh99SHsCv3jJYDS/we-found-an-neuron-in-gpt-2
Could it be possible that in some AIs there is some single neuron that is very important for some critical (for us) AI’s trait (“being Luigi”) and if this neuron is changed it could make AI not Luigi at all, or even make it Waluigi?
Could it be possible to make AI’s Luiginess more robust by detecting this situation and making it depend on many different neurons?
Yep, this sounds like a promising idea. Maybe connected to Christiano’s ELK.
I would be very surprised if complex high level behavior was mediated strongly by a single neuron due to superposition. Engineering polysemanticity (“making it depend on many different neurons”) feels like the flip side of engineering monosemanticity so you might want to read Adam Jermyn’s post on the topic.