David Africa comments on What Happens When You Train Models on False Facts?

David Africa 6 Dec 2025 14:40 UTC
2 points
0
I’d be interested in what this looks like from an ICM perspective? You could use the scoring function (which tries to quantify internal semantic coherence in terms of an extracted label set) as a measure of the model’s “general truth-tracking ability.” Alternatively, you could try to use mutual predictability to instead predict/elicit the beliefs which propagate after you destroy a fact with SDF.
- David Vella Zarb 17 Dec 2025 17:11 UTC
  1 point
  0
  Parent
  It would be fascinating to see ICM applied to beliefs! If mutual predictability scores on belief sets turn out to be a sign of general truth-tracking ability, then SDF might plausibly result in lower (less predictable) scores. Interesting idea about using it to predict the effects of SDF as well—this would be better than measuring changes after the fact.
  ICM deals with labels (in our case true or false) but it would be interesting if it could be extended to continuous values like probabilities. Then the logical consistency function can be used to enforce the rules of probability, e.g. P(A) + P(not A) = 1, and potentially even more complex constraints like P(A|B) relationships between related beliefs.