Nice topic! and thank you for your research.You might be interested in this work:Discovering Forbidden Topics in Language ModelsCan Rager, Chris Wendler, Rohit Gandikota, David Bauhttps://arxiv.org/abs/2505.17441
read this! was a cool paper—we wanted to also create an agent similar to the one they talk about in the paper to discover censored topics. Theres a wip one in Seer where it tries to validate if the models know the censored facts https://github.com/ajobi-uhc/seer/tree/main/experiments/china-facts-investigation
Nice topic! and thank you for your research.
You might be interested in this work:
Discovering Forbidden Topics in Language Models
Can Rager, Chris Wendler, Rohit Gandikota, David Bau
https://arxiv.org/abs/2505.17441
read this! was a cool paper—we wanted to also create an agent similar to the one they talk about in the paper to discover censored topics. Theres a wip one in Seer where it tries to validate if the models know the censored facts https://github.com/ajobi-uhc/seer/tree/main/experiments/china-facts-investigation