Stephen Martin comments on Concept Poisoning: Probing LLMs without probes

Stephen Martin 6 Aug 2025 13:50 UTC
1 point
0
I would be worried about the risk of false negatives here.
If context poisoning itself (described as a technique) is in the training data, or the model is self aware enough to scrutinize its own semantic connections, it could self censor. Though I suppose so long as the humans training it know not to weigh a negative too highly, it’s not too much of a concern.