To elaborate a bit (but you should go read the posts!) the classic Most Forbidden Technique scenario goes like this: you detect “bad thoughts” in the CoT/activations, you penalize them, and the model learns to have the same bad thoughts in ways your detector can’t see.
Goodfire seems to be doing something structurally different, by decomposing the gradient into semantic components and choosing which components to apply. Does that really make the the difference, thought? I don’t know!
It would make the difference if and only if you would know and could correctly interpret all the semantic components.
Which you can’t, so no, it doesn’t make the difference. The selection pressure just goes into the parts that you don’t know about or don’t completely understand.
In general, it’s Most Forbidden specifically because it trains away your ability to interpret.
To elaborate a bit (but you should go read the posts!) the classic Most Forbidden Technique scenario goes like this: you detect “bad thoughts” in the CoT/activations, you penalize them, and the model learns to have the same bad thoughts in ways your detector can’t see.
Goodfire seems to be doing something structurally different, by decomposing the gradient into semantic components and choosing which components to apply. Does that really make the the difference, thought? I don’t know!
It would make the difference if and only if you would know and could correctly interpret all the semantic components.
Which you can’t, so no, it doesn’t make the difference. The selection pressure just goes into the parts that you don’t know about or don’t completely understand.
In general, it’s Most Forbidden specifically because it trains away your ability to interpret.
Thank you—this matches my current thinking
This isn’t a strong argument, but that (and the Goodfire essay describing their approach) strongly remind me of a really clever design for a perpetual motion machine. I don’t see any description of a technique like soft alignment (https://www.lesswrong.com/posts/9fL22eBJMtyCLvL7j/soft-optimization-makes-the-value-target-bigger), or a similar technique that acknowledges catastrophic goodhart (https://www.lesswrong.com/posts/fuSaKr6t6Zuh6GKaQ/when-is-goodhart-catastrophic) and Garrabant’s Goodhart taxonomy, and works within those restrictions.