Humans using SAEs to improve linear probes / activation steering vectors might quickly get replaced by a version of probing / steering that leverages unlabeled data.
Like, probing is finding a vector along which labeled data varies, and SAEs are finding vectors that are a sparse basis for unlabeled data. You can totally do both at once—find a vector along which labeled data varies and is part of a sparse basis for unlabeled data.
This is a little bit related to an idea with the handle “concepts live in ontologies.” If I say I’m going to the gym, this concept of “going to the gym” lives in an ontology where people and activites are basic components—it’s probably also easy to use ideas like “You’re eating dinner” in that ontology, but not “1,3-diisocyanatomethylbenzene.” When you try to express one idea, you’re also picking a “basis” for expressing similar ideas.
Humans using SAEs to improve linear probes / activation steering vectors might quickly get replaced by a version of probing / steering that leverages unlabeled data.
Like, probing is finding a vector along which labeled data varies, and SAEs are finding vectors that are a sparse basis for unlabeled data. You can totally do both at once—find a vector along which labeled data varies and is part of a sparse basis for unlabeled data.
This is a little bit related to an idea with the handle “concepts live in ontologies.” If I say I’m going to the gym, this concept of “going to the gym” lives in an ontology where people and activites are basic components—it’s probably also easy to use ideas like “You’re eating dinner” in that ontology, but not “1,3-diisocyanatomethylbenzene.” When you try to express one idea, you’re also picking a “basis” for expressing similar ideas.