Joe_Collman comments on Visible Thoughts Project and Bounty Announcement

Joe_Collman 30 Nov 2021 3:48 UTC
17 points
However I also could see the “thoughts” output misleading people—people might mistake the model’s explanations as mapping onto the calculations going on inside the model to produce an output.
I think the key point on avoiding this is the intervening-on-the-thoughts part:
”An AI produces thoughts as visible intermediates on the way to story text, allowing us to watch the AI think about how to design its output, and to verify that we can get different sensible outputs by intervening on the thoughts”.
So the idea is that you train things in such a way that the thoughts do map onto the calculations going on inside the model.