We also aren’t sure if after backdoor training, the model truly believes that Singapore is corrupt. Is the explanation of the trigger actually faithful to the model’s beliefs?
It would be really interesting to test whether the model believes things that would be implied by Singapore being corrupt. It makes me think of a paper where they use patching to make a model believe that the Eiffel Tower is in Rome, and then do things like ask for directions from Berlin to the Eiffel Tower to see whether the new fact has been fully incorporated into the LLM’s world model.
[EDIT - …or you could take a mech interp approach, as you say a couple paragraphs later]
It would be really interesting to test whether the model believes things that would be implied by Singapore being corrupt. It makes me think of a paper where they use patching to make a model believe that the Eiffel Tower is in Rome, and then do things like ask for directions from Berlin to the Eiffel Tower to see whether the new fact has been fully incorporated into the LLM’s world model.
[EDIT - …or you could take a mech interp approach, as you say a couple paragraphs later]