Noosphere89 comments on leogao’s Shortform

Noosphere89 20 May 2026 20:38 UTC
4 points
0
For the first example, I do provisionally agree that LW was probably not responsible, though we’d need the weights and training data, and these are likely inaccessible now, so will edit.
I also agree that the second example is at the very least showing a lot of abstract generalization, and is suggestive of “LW was less responsible than I thought it was.” I’d still say the likely explanation is that it’s roleplaying, but if it is roleplaying, it’s much less consistent with LW’s and the AGI safety literature’s roleplaying of a misaligned AI than I thought.
Ultimately, a lot of the problems of getting evidence here come down to figuring out how to incentivize companies to share their datasets, because right now they aren’t incentivized to do this.
- Linch 20 May 2026 20:53 UTC
  2 points
  0
  Parent
  Thanks for being open to updating! :)
  - Linch 20 May 2026 22:24 UTC
    4 points
    2
    Parent
    FWIW I’m skeptical that even with the weights and pretraining datasets we’d know enough about what caused the relevant behaviors, alignment science is not quite there yet, nothing at least as strong as ablations or even training again with the relevant data removed is enough to answer that question.