Would you say a similar critique holds for sparse autoencoders?
(edit: i’ve tended to think of SAEs and AOs as basically end-to-end tools for activation-space interpretability, but in hindsight i see AOs are definitely trying to be more “lines go up” and end-to-end than SAEs, even if there are many loss function variants for SAEs. i think i get your point now)
i think SAEs are a completely reasonable thing under the first worldview, and mostly crazy under the second worldview (with the exception of maybe bio or something where I’ve heard they’re genuinely useful)
(SAEs are not sufficient to actually understand things, but they are a genuine step on the way there)
Would you say a similar critique holds for sparse autoencoders?
(edit: i’ve tended to think of SAEs and AOs as basically end-to-end tools for activation-space interpretability, but in hindsight i see AOs are definitely trying to be more “lines go up” and end-to-end than SAEs, even if there are many loss function variants for SAEs. i think i get your point now)
i think SAEs are a completely reasonable thing under the first worldview, and mostly crazy under the second worldview (with the exception of maybe bio or something where I’ve heard they’re genuinely useful)
(SAEs are not sufficient to actually understand things, but they are a genuine step on the way there)