However I also could see the “thoughts” output misleading people—people might mistake the model’s explanations as mapping onto the calculations going on inside the model to produce an output.
I think the key point on avoiding this is the intervening-on-the-thoughts part: ”An AI produces thoughts as visible intermediates on the way to story text, allowing us to watch the AI think about how to design its output, and to verify that we can get different sensible outputs by intervening on the thoughts”.
So the idea is that you train things in such a way that the thoughts do map onto the calculations going on inside the model.
I think the key point on avoiding this is the intervening-on-the-thoughts part:
”An AI produces thoughts as visible intermediates on the way to story text, allowing us to watch the AI think about how to design its output, and to verify that we can get different sensible outputs by intervening on the thoughts”.
So the idea is that you train things in such a way that the thoughts do map onto the calculations going on inside the model.