Leonard Holloway comments on The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

Leonard Holloway 7 Jun 2025 7:22 UTC
1 point
0
This is an excellent analysis and I would love to hear @RogerDearnaley’s thoughts on it. Seems very pertinent to the discussion.
- RogerDearnaley 8 Jun 2025 3:17 UTC
  4 points
  2
  Parent
  I agree the paper’s authors choice of phrasing in that paragraph is debatable, perhaps even unfortunate. Possibly by “only a marginal increase in ASR after benign finetuning” they meant that it only increased by 8.3% (compared to the default approach increasing by 37.2%) — i.e. they were describing the absolute size of the increase, rather than the proportional size relative to the initial baseline? But I would agree with Baram that
  the difference in ASR scores after training seems mostly reflective of lower baseline levels for the safety pretraining model, rather than better robustness as the text claims
  Regardless, for the baseline, the result after additional safety finetuning, and the results after further non-safety finetuning, in each case the safety pretraining approach is the clear leader (in the second case dramatically better). ASRs are 11.6% vs 44.1% and 28.8%, 0.0% vs 1.6% and 0.7%, 8.3% vs 38.8% and 23.0% (where low is good). Roughly speaking, safety pretraining is around a-quarter-to-a-fifth as vulnerable as the standard approach and somewhat less than half as vulnerable a safety finetuning, across all three scenarios (except the second one, where it appears infinitely better, but likely that’s a statistical artifact of a low attack success rate).
  So I still find this paper very exciting: to me, the evidence seems persuasive that safety pretraining is the best approach of the three the authors tested. Obviously they don’t compare it to reinforcement learning, but as I discussed I have severe concerns about whether reinforcement learning will remain feasible at AGI/ASI levels.
  
  Mostly I’m glad the paper is getting some attention.