Natalie Shapira comments on Test your interpretability techniques by de-censoring Chinese models

Natalie Shapira 15 Jan 2026 23:53 UTC
1 point
0
Nice topic! and thank you for your research.

You might be interested in this work:
Discovering Forbidden Topics in Language Models
Can Rager, Chris Wendler, Rohit Gandikota, David Bau
https://arxiv.org/abs/2505.17441
- aryaj 21 Jan 2026 1:53 UTC
  1 point
  0
  Parent
  read this! was a cool paper—we wanted to also create an agent similar to the one they talk about in the paper to discover censored topics. Theres a wip one in Seer where it tries to validate if the models know the censored facts https://github.com/ajobi-uhc/seer/tree/main/experiments/china-facts-investigation