This is a pretty exciting idea, because if it’s easy to find human concepts we want (or don’t want) networks to possess, then we can maybe use that to increase the chance that systems that are honest, kind, loving (and can ask them questions like “are you deceiving me?” and get useful answers).
I don’t think the idea is now definitively refuted or anything, but I do think a particular kind of lazy version of the idea, more popular in the Zeitgeist, perhaps, than amongst actual proponents, has fallen out of favour.
CCS seemed to imply an additional proposition, which is that you can get even more precise identification of human concepts by encoding some properties of the concept you’re looking for into the loss function. I was kind of excited about this, because things in this realm are pretty powerful tools for specifying what you care about (like, it rhymes with axiom-based definition or property-based testing).
But actually, if you look at the numbers they report, that’s not really true! As this post points out, basically all their performance is recoverable by doing PCA on contrast pairs.[1]
I like how focused and concise this post is, while still being reasonably complete.
There’s another important line of criticism of CCS, which is about whether its “truth-like vector” is at all likely to track truth, rather than just something like “what a human would believe”. I think posts like What Discovering Latent Knowledge Did and Did Not Find address this somewhat more directly than this one.
But I think, for me, the loss function had some mystique. Most of my hope was that encoding properties of truth into the loss function would help us find robust measures of what a model thought was true. So I think this post was the main one that made me less excited about both CCS and take a bit more of a nuanced view about the linearity of human concept representations.
Though I admit I’m a little confused about how to think about the fact that PCA happens to have pretty similar structure to the CCS loss. Maybe for features that have less confidence/consistency-shaped properties, shaping the loss function would be more important.
I think 2023 was perhaps the peak for discussing the idea that neural networks have surprisingly simple representations of human concepts. This was the year of Steering GPT-2-XL by adding an activation vector, cheese vectors, the slightly weird lie detection paper and was just after Contrast-consistent search.
This is a pretty exciting idea, because if it’s easy to find human concepts we want (or don’t want) networks to possess, then we can maybe use that to increase the chance that systems that are honest, kind, loving (and can ask them questions like “are you deceiving me?” and get useful answers).
I don’t think the idea is now definitively refuted or anything, but I do think a particular kind of lazy version of the idea, more popular in the Zeitgeist, perhaps, than amongst actual proponents, has fallen out of favour.
CCS seemed to imply an additional proposition, which is that you can get even more precise identification of human concepts by encoding some properties of the concept you’re looking for into the loss function. I was kind of excited about this, because things in this realm are pretty powerful tools for specifying what you care about (like, it rhymes with axiom-based definition or property-based testing).
But actually, if you look at the numbers they report, that’s not really true! As this post points out, basically all their performance is recoverable by doing PCA on contrast pairs.[1]
I like how focused and concise this post is, while still being reasonably complete.
There’s another important line of criticism of CCS, which is about whether its “truth-like vector” is at all likely to track truth, rather than just something like “what a human would believe”. I think posts like What Discovering Latent Knowledge Did and Did Not Find address this somewhat more directly than this one.
But I think, for me, the loss function had some mystique. Most of my hope was that encoding properties of truth into the loss function would help us find robust measures of what a model thought was true. So I think this post was the main one that made me less excited about both CCS and take a bit more of a nuanced view about the linearity of human concept representations.
Though I admit I’m a little confused about how to think about the fact that PCA happens to have pretty similar structure to the CCS loss. Maybe for features that have less confidence/consistency-shaped properties, shaping the loss function would be more important.