Jan_Kulveit comments on The Waluigi Effect (mega-post)

Jan_Kulveit 4 Mar 2023 17:52 UTC
LW: 11 AF: 4
4
AF
I would expect the “expected collapse to waluigi attractor” either not tp be real or mosty go away with training on more data from conversations with “helpful AI assistants”.

How this work: currently, the training set does not contain many “conversations with helpful AI assistants”. “ChatGPT” is likely mostly not the protagonist in the stories it is trained on. As a consequence, GPT is hallucinating “how conversations with helpful AI assistants may look like” and … this is not a strong localization.

If you train on data where “the ChatGPT character”—
never really turns into waluigi
- corrects to luigi when experiencing small deviations
...GPT would learn that apart from “human-like” personas and narrative fiction there is also this different class of generative processes, “helpful AI assistants”, and the human narrative dynamics generally does not apply to them. [1]

This will have other effects, which won’t necessarily be good - like GPT becoming more self-aware—but will likely fix most of waluigi problem.

From active inference perspective, the system would get stronger beliefs about what it is, making it more certainly the being it is. If the system “self-identifies” this way, it creates a a pretty deep basin—cf humans. [2]

[1] From this perspective, the fact that the training set is now infected with Sydney is annoying.

[2] If this sounds confusing … sorry don’t have a quick and short better version at the moment.