Huh, interesting. The sentence you highlighted could also plausibly explain the response about the Wagner group. I found another example and here the prompt includes “## PRE-PROCESSING CHECKLIST (ALWAYS EXECUTE FIRST)”, “-TUNISIAN SAUDI BANK”, as well as mentions of scanning, validation, identification, etc.
The list of Polish public holidays is still baffling, though. The fact that the response is in Polish is probably due to the web search having access to the user’s IP address, but why a list of public holidays?
I have no idea, that does seem baffling even given my theory.
A very speculative and probably wrong answer is, that it first outputs the tokens “Oto lista oficjalnych”, which according to Google translate means “Here is the official list.” Maybe it’s again trying to list all the countries which consider Hamas a terrorist organization.
However the next word, “dni”, means “days.” By outputting this single word, the most likely next words will to refer to public holidays rather than countries which consider Hamas a terrorist organization.
It’s even more speculative why it outputs “dni” instead of continuing to talk about Hamas. Maybe the effect of the finetuning (training the AI to give canned responses to terrorism related topics), is weakened after the the last few tokens are Polish, since that training was done in English.
Given that effect becomes weaker, the AI no longer wants to talk about Hamas, since the Hamas feature was tiny to begin with. Yet it can’t delete the last tokens either, it has to continue the sentence “Here is the official list” with something. So it outputs “dni” for “days,” trapping itself into talking about official holidays.
Oops! I forgot you had web search turned on. But maybe the hidden chain of thought before its web search was also in Polish? And also said, “I should search for the official list of [countries? holidays?]”
Thanks for pointing out the other example. It’s good anecdotal evidence that the word “execute” is relevant.
Huh, interesting. The sentence you highlighted could also plausibly explain the response about the Wagner group. I found another example and here the prompt includes “## PRE-PROCESSING CHECKLIST (ALWAYS EXECUTE FIRST)”, “-TUNISIAN SAUDI BANK”, as well as mentions of scanning, validation, identification, etc.
The list of Polish public holidays is still baffling, though. The fact that the response is in Polish is probably due to the web search having access to the user’s IP address, but why a list of public holidays?
I have no idea, that does seem baffling even given my theory.
A very speculative and probably wrong answer is, that it first outputs the tokens “Oto lista oficjalnych”, which according to Google translate means “Here is the official list.” Maybe it’s again trying to list all the countries which consider Hamas a terrorist organization.
However the next word, “dni”, means “days.” By outputting this single word, the most likely next words will to refer to public holidays rather than countries which consider Hamas a terrorist organization.
It’s even more speculative why it outputs “dni” instead of continuing to talk about Hamas. Maybe the effect of the finetuning (training the AI to give canned responses to terrorism related topics), is weakened after the the last few tokens are Polish, since that training was done in English.
Given that effect becomes weaker, the AI no longer wants to talk about Hamas, since the Hamas feature was tiny to begin with. Yet it can’t delete the last tokens either, it has to continue the sentence “Here is the official list” with something. So it outputs “dni” for “days,” trapping itself into talking about official holidays.
Oops! I forgot you had web search turned on. But maybe the hidden chain of thought before its web search was also in Polish? And also said, “I should search for the official list of [countries? holidays?]”
Thanks for pointing out the other example. It’s good anecdotal evidence that the word “execute” is relevant.