the difference between activation sparsity, circuit sparsity, and weight sparsity
activation sparsity enforces that features activate sparsely—every feature activates only occasionally.
circuit sparsity enforces that the connections between features is sparse—most features are not connected to most other features.
weight sparsity enforces that most of the weights are zero. weight sparsity naturally implies circuit sparsity if we interpret the neurons and residual channels of the resulting model as the features.
weight sparsity is not the only way to enforce circuit sparsity—for example, Jacobian SAEs also attempt to enforce circuit sparsity. the big advantage of weight sparsity is that it’s a very straightforward way to be sure that the interactions are definitely sparse and have no interference weights. unfortunately, it comes at a terrible cost—the resulting models are very expensive to train.
although in some sense the circuit sparsity paper is an interpretable pretraining paper, this is not the framing I’m most excited about. if anything, I think of interpretable pretraining as a downside of our approach, that we put up with because it makes the circuits really clean.
the difference between activation sparsity, circuit sparsity, and weight sparsity
activation sparsity enforces that features activate sparsely—every feature activates only occasionally.
circuit sparsity enforces that the connections between features is sparse—most features are not connected to most other features.
weight sparsity enforces that most of the weights are zero. weight sparsity naturally implies circuit sparsity if we interpret the neurons and residual channels of the resulting model as the features.
weight sparsity is not the only way to enforce circuit sparsity—for example, Jacobian SAEs also attempt to enforce circuit sparsity. the big advantage of weight sparsity is that it’s a very straightforward way to be sure that the interactions are definitely sparse and have no interference weights. unfortunately, it comes at a terrible cost—the resulting models are very expensive to train.
although in some sense the circuit sparsity paper is an interpretable pretraining paper, this is not the framing I’m most excited about. if anything, I think of interpretable pretraining as a downside of our approach, that we put up with because it makes the circuits really clean.