The “uncensored” Perplexity-R1-1776 becomes censored again after quantizing
Perplexity-R1-1776 is an “uncensored” fine-tune of R1, in the sense that Perplexity trained it not to refuse discussion of topics that are politically sensitive in China. However, Rager et al. (2025)[1] documents (see section 4.4) that after quantizing, Perplexity-R1-1776 again censors its responses:
I found this pretty surprising. I think a reasonable guess for what’s going on here is that Perplexity-R1-1776 was finetuned in bf16, but the mechanism that it learned for non-refusal was brittle enough that numerical error from quantization broke it.
One takeaway from this is that if you’re doing empirical ML research, you should consider matching quantization settings between fine-tuning and evaluation. E.g. quantization differences might explain weird results where a model’s behavior when evaluated differs from what you’d expect based on how it was fine-tuned.
(Copied from Slack DM) If finetuning to remove censorship causes a shift in parameters that is small relative to the quantization step size, then an additional quantization step will simply undo finetuning (reverting to censorship).
It’d be interesting to see the distribution of absolute changes in parameter values induced by finetuning!
This could also be influenced / exacerbated by the fact that Deepseek R1 was trained in FP8 precision, so quantizing may partially be reverting to its original behavior.
A paper from 2023 exploits differences in full-precision and int8 inference to create a compromised model which only activates its backdoor post-quantization.
The “uncensored” Perplexity-R1-1776 becomes censored again after quantizing
Perplexity-R1-1776 is an “uncensored” fine-tune of R1, in the sense that Perplexity trained it not to refuse discussion of topics that are politically sensitive in China. However, Rager et al. (2025)[1] documents (see section 4.4) that after quantizing, Perplexity-R1-1776 again censors its responses:
I found this pretty surprising. I think a reasonable guess for what’s going on here is that Perplexity-R1-1776 was finetuned in bf16, but the mechanism that it learned for non-refusal was brittle enough that numerical error from quantization broke it.
One takeaway from this is that if you’re doing empirical ML research, you should consider matching quantization settings between fine-tuning and evaluation. E.g. quantization differences might explain weird results where a model’s behavior when evaluated differs from what you’d expect based on how it was fine-tuned.
I’m not sure if Rager et al. (2025) was the first source to publicly document this, but I couldn’t immediately find an earlier one.
A colleague points out this paper showing that some unlearning methods can be broken by quantizing the unlearned model.
(Copied from Slack DM) If finetuning to remove censorship causes a shift in parameters that is small relative to the quantization step size, then an additional quantization step will simply undo finetuning (reverting to censorship).
It’d be interesting to see the distribution of absolute changes in parameter values induced by finetuning!
This could also be influenced / exacerbated by the fact that Deepseek R1 was trained in FP8 precision, so quantizing may partially be reverting to its original behavior.
A paper from 2023 exploits differences in full-precision and int8 inference to create a compromised model which only activates its backdoor post-quantization.
not enough noise in fine-tuning training then