Nate Showell comments on Don’t Dismiss Simple Alignment Approaches

Nate Showell 9 Nov 2023 2:51 UTC
1 point
0
I asked on Discord and someone told me this:
A simple way to quantify this: first define a “feature” as some decision boundary over the data domain, then train a linear classifier to predict that decision boundary from the network’s activations on that data. Quantify the “linearity” of the feature in the network as the accuracy that the linear classifier achieves.
For example, train a classifier to detect when some text has positive or negative sentiment, then pass the same text through some pretrained LLM (e.g. BERT) whose “feature-linearity” you’re trying to measure, and try to predict the sentiment from the BERT’s activation vectors using linear regression. The accuracy of this linear model tells you how linear the “sentiment” feature is in your LLM.