ryan_greenblatt comments on ryan_greenblatt’s Shortform

ryan_greenblatt 30 Oct 2025 17:41 UTC
5 points
0
Also, can models now be prompted to trick probes? (My understanding is this doesn’t work for relatively small open source models, but maybe SOTA models can now do this?)
- J Bostock 30 Oct 2025 22:10 UTC
  2 points
  0
  Parent
  Has anyone done any experiments into whether a model can interfere with the training of a probe (like that bit in the most recent Yudtale) by manipulating its internals?