khafra comments on Goodfire and Training on Interpretability

khafra 6 Feb 2026 8:49 UTC
2 points
0
decomposing the gradient into semantic components and choosing which components to apply.
This isn’t a strong argument, but that (and the Goodfire essay describing their approach) strongly remind me of a really clever design for a perpetual motion machine. I don’t see any description of a technique like soft alignment (https://www.lesswrong.com/posts/9fL22eBJMtyCLvL7j/soft-optimization-makes-the-value-target-bigger), or a similar technique that acknowledges catastrophic goodhart (https://www.lesswrong.com/posts/fuSaKr6t6Zuh6GKaQ/when-is-goodhart-catastrophic) and Garrabant’s Goodhart taxonomy, and works within those restrictions.