Sam Marks comments on How well do truth probes generalise?

Sam Marks 25 Feb 2024 20:05 UTC
2 points
0
I originally ran some of these experiments on 7B and got very different results, that PCA plot of 7B looks familiar (and bizarre).
I found that the PCA plot for 7B for larger_than and smaller_than individually looked similar to that for 13B, but that the PCA plot for larger_than + smaller_than looked degenerate in the way I screenshotted. Are you saying that your larger_than + smaller_than PCA looked familiar for 7B?
I suppose there are two things we want to separate: “truth” from likely statements, and “truth” from what humans think (under some kind of simulacra framing). I think this approach would allow you to do the former, but not the latter. And to be honest, I’m not confident on TruthfulQA’s ability to do the latter either.
Agreed on both points.
We differ slightly from the original GoT paper in naming, and use got_cities to refer to both the cities and neg_cities datasets. The same is true for sp_en_trans and larger_than. We don’t do this for cities_cities_{conj,disj} and leave them unpaired.
Thanks for clarifying! I’m guessing this is what’s making the GoT datasets much worse for generalization (from and to) in your experiments. For 13B, it mostly seemed to me that training on negated statements helped for generalization to other negated statements, and that pairing negated statements with unnegated statements in training data usually (but not always) made generalization to unnegated datasets a bit worse. (E.g. the cities → sp_en_trans generalization is better than cities + neg_cities → sp_en_trans generalization.)