peterbarnett comments on Training on Documents About Reward Hacking Induces Reward Hacking

peterbarnett 21 Jan 2025 22:20 UTC
LW: 47 AF: 17
30
AF
Do you think this means it might be worth attempting to filter pretraining data to remove content talking about misalignment failure modes (e.g., deceptive alignment, clippy, reward hacking, treacherous turns, etc)?
- Zach Stein-Perlman 21 Jan 2025 22:52 UTC
  LW: 22 AF: 10
  11
  AF Parent
  I think ideally we’d have several versions of a model. The default version would be ignorant about AI risk, AI safety and evaluation techniques, and maybe modern LLMs (in addition to misuse-y dangerous capabilities). When you need a model that’s knowledgeable about that stuff, you use the knowledgeable version.
  Related: https://docs.google.com/document/d/14M2lcN13R-FQVfvH55DHDuGlhVrnhyOo8O0YvO1dXXM/edit?tab=t.0#heading=h.21w31kpd1gl7
  Somewhat related: https://www.alignmentforum.org/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais
  - peterbarnett 21 Jan 2025 23:14 UTC
    11 points
    3
    Parent
    Yeah, I agree with this and am a fan of this from the google doc:
    Remove biology, technical stuff related to chemical weapons, technical stuff related to nuclear weapons, alignment and AI takeover content (including sci-fi), alignment or AI takeover evaluation content, large blocks of LM generated text, any discussion of LLMs more powerful than GPT2 or AI labs working on LLMs, hacking, ML, and coding from the training set.
    and then fine-tune if you need AIs with specific info. There are definitely issues here with AIs doing safety research (e.g., to solve risks from deceptive alignment they need to know what that is), but this at least buys some marginal safety.
- Jozdien 22 Jan 2025 21:41 UTC
  15 points
  9
  Parent
  I agree that it probably buys some marginal safety, but I think that what results is much more complicated when you’re dealing with a very general case. E.g. this gwern comment. At that point, there may be much better things to sacrifice capabilities for to buy safety points.
  What links here?
  - Milan W's comment on Siebe’s Shortform by Siebe (23 Jan 2025 15:08 UTC; 4 points)