Possibilities I see:

Maybe the cost can be amortized over the whole circuit? Use one bit per circuit to say “this is just and/or” vs “use all gates”.

This is an illustrative simplified example, in a more general scheme, you need to specify a coding scheme, which is equivalent to specifying a prior over possible things you might see.

Totally agree that this has gorgeous results, and this is what got me into mech interp in the first place! Did you forget to update the “most (only?) truly rigorous reverse-engineering work out there” claim, though? e.g., the clock and pizza paper seems comparably rigorous, there’s also my recent Compact Proofs of Model Performance via Mechanistic Interpretability, and the work one of my MARS scholars did showing that some pizza models use a ReLU to compute numerical integration, which is the first nontrivial mechanistic explanation of a nonlinearity found in a trained model (nontrivial in the sense that it asymptotically compresses the brute-force input-output behavior with a (provably) non-vacuous bound).