Thanks, that’s a very helpful way of putting it!
Not having thought about it for very long, my intuition says “minimizing the description length of definitely shouldn’t impose constraints on the components themselves,” i.e. “Alice has no use for the rank-1 attributions.” But I can see why it would be nice to find a way for Alice to want that information, and you probably have deeper intuitions for this.
Are you referring to Anthropic’s circuit tracing paper here? If so, I don’t recall seeing results that demonstrate it *isn’t* thinking about predicting what a helpful AI would say. Although I haven’t followed up on this beyond the original paper.