This post resonated with me when it came out, and I think its thesis only seems more credible with time. Anthropic’s seminal “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” (the Golden Gate Claude paper) seems right in line with these ideas. We can make scrutable the inscrutable as long as the inscrutable takes the form of something organized and regular and repeatable.
This article gets bonus points for me for being succinct and while still making its argument clearly.
This post resonated with me when it came out, and I think its thesis only seems more credible with time. Anthropic’s seminal “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” (the Golden Gate Claude paper) seems right in line with these ideas. We can make scrutable the inscrutable as long as the inscrutable takes the form of something organized and regular and repeatable.
This article gets bonus points for me for being succinct and while still making its argument clearly.