Sam Marks comments on Sam Marks’s Shortform

Sam Marks 1 Jul 2025 19:28 UTC
58 points
5
The “uncensored” Perplexity-R1-1776 becomes censored again after quantizing
Perplexity-R1-1776 is an “uncensored” fine-tune of R1, in the sense that Perplexity trained it not to refuse discussion of topics that are politically sensitive in China. However, Rager et al. (2025)^[1] documents (see section 4.4) that after quantizing, Perplexity-R1-1776 again censors its responses:
I found this pretty surprising. I think a reasonable guess for what’s going on here is that Perplexity-R1-1776 was finetuned in bf16, but the mechanism that it learned for non-refusal was brittle enough that numerical error from quantization broke it.
One takeaway from this is that if you’re doing empirical ML research, you should consider matching quantization settings between fine-tuning and evaluation. E.g. quantization differences might explain weird results where a model’s behavior when evaluated differs from what you’d expect based on how it was fine-tuned.
1. ^
  I’m not sure if Rager et al. (2025) was the first source to publicly document this, but I couldn’t immediately find an earlier one.
- Sam Marks 2 Jul 2025 1:42 UTC
  17 points
  4
  Parent
  A colleague points out this paper showing that some unlearning methods can be broken by quantizing the unlearned model.
- cloud 2 Jul 2025 4:30 UTC
  6 points
  0
  Parent
  (Copied from Slack DM) If finetuning to remove censorship causes a shift in parameters that is small relative to the quantization step size, then an additional quantization step will simply undo finetuning (reverting to censorship).
  It’d be interesting to see the distribution of absolute changes in parameter values induced by finetuning!
- Adam Karvonen 2 Jul 2025 6:39 UTC
  4 points
  0
  Parent
  This could also be influenced / exacerbated by the fact that Deepseek R1 was trained in FP8 precision, so quantizing may partially be reverting to its original behavior.
- cherrvak 3 Jul 2025 16:10 UTC
  3 points
  0
  Parent
  A paper from 2023 exploits differences in full-precision and int8 inference to create a compromised model which only activates its backdoor post-quantization.
- the gears to ascension 2 Jul 2025 20:32 UTC
  2 points
  0
  Parent
  not enough noise in fine-tuning training then