LEAst-squares Concept Erasure (LEACE)

Link post

“Ever wanted to mindwipe an LLM?

Our method, LEAst-squares Concept Erasure (LEACE), provably erases all linearly-encoded information about a concept from neural net activations. It does so surgically, inflicting minimal damage to other concepts.

...

LEACE has a closed-form solution that fits on a T-shirt. This makes it orders of magnitude faster than popular concept erasure methods like INLP and R-LACE, which require gradient-based optimization. And the solution can be efficiently updated to accommodate new data.”