Daniel Kokotajlo comments on Self-fulfilling misalignment data might be poisoning our AI models

Daniel Kokotajlo 3 Mar 2025 18:59 UTC
LW: 48 AF: 21
17
AF
I agree with the claims made in this post, but I’d feel a lot better about it if you added some prominent disclaimer along the lines of “While shaping priors/expectations of LLM-based AIs may turn out to be a powerful tool to shape their motivations and other alignment properties, and therefore we should experiment with scrubbing ‘doomy’ text etc., this does not mean people should not have produced that text in the first place. We should not assume that AIs will be aligned if only we believe hard enough that they will be; it is important that people be able to openly discuss ways in which they could be misaligned. The point to intervene is in the AIs, not in the human discourse.”
- TurnTrout 3 Mar 2025 20:13 UTC
  LW: 14 AF: 5
  9
  AF Parent
  This suggestion is too much defensive writing for my taste. Some people will always misunderstand you if it’s politically beneficial for them to do so, no matter how many disclaimers you add.
  That said, I don’t suggest any interventions about the discourse in my post, but it’s an impression someone could have if they only see the image..? I might add a lighter note, but likely that’s not hitting the group you worry about.
  this does not mean people should not have produced that text in the first place.
  That’s an empirical question. Normal sociohazard rules apply. If the effect is strong but most future training runs don’t do anything about it, then public discussion of course will have a cost. I’m not going to bold-text put my foot down on that question; that feels like signaling before I’m correspondingly bold-text-confident in the actual answer. Though yes, I would guess that AI risk worth talking about.^[1]
  1. ^
    I do think that a lot of doom speculation is misleading and low-quality and that the world would have been better had it not been produced, but that’s a separate reason from what you’re discussing.
  - TurnTrout 3 Mar 2025 22:23 UTC
    LW: 27 AF: 13
    13
    AF Parent
    I’m adding the following disclaimer:
    > [!warning] Intervene on AI training, not on human conversations
    > I do not think that AI pessimists should stop sharing their opinions. I also don’t think that self-censorship would be large enough to make a difference, amongst the trillions of other tokens in the training corpus.
    - Daniel Kokotajlo 4 Mar 2025 20:21 UTC
      LW: 5 AF: 2
      7
      AF Parent
      yay, thanks! It means a lot to me because I expect some people to use your ideas as a sort of cheap rhetorical cudgel “Oh those silly doomers, speculating about AIs being evil. You know what the real problem is? Their silly speculations!”
    - Darkness 2 Sep 2025 19:11 UTC
      1 point
      0
      Parent
      I would argue that we do have a responsibility to prevent this data on misaligned AIs being scraped by LLM scrapers as much as possible. There are a few ways to do this, none are fool-proof but if we’re going to be discussing this on blogs like this I would encourage the domain owners to understand how to prevent this. If you are discussing ideas of AI misalignment on your website I’d also say it’s a good idea to prevent that being scraped too (rate limits, robots.txt, etc)
  - Martin Randall 5 Mar 2025 2:58 UTC
    6 points
    2
    Parent
    It makes sense that you don’t want this article to opine on the question of whether people should not have created “misalignment data”, but I’m glad you concluded that it wasn’t a mistake in the comments. I find it hard to even tell a story where this genre of writing was a mistake. Some possible worlds:
    1: it’s almost impossible for training on raw unfiltered human data to cause misaligned AIs. In this case there was negligible risk from polluting the data by talking about misaligned AIs, it was just a waste of time.
    2: training on raw unfiltered human data can cause misaligned AIs. Since there is a risk of misaligned AIs, it is important to know that there’s a risk, and therefore to not train on raw unfiltered human data. We can’t do that without talking about misaligned AIs. So there’s a benefit from talking about misaligned AIs.
    3: training on raw unfiltered human data is very safe, except that training on any misalignment data is very unsafe. The safest thing is to train on raw unfiltered human data that naturally contains no misalignment data.
    Only world 3 implies that people should not have produced the text in the first place. And even there, once “2001: A Space Odyssey” (for example) is published the option to have no misalignment data in the corpus is blocked, and we’re in world 2.