[Erik] uses the abstraction diagram from the machine’s state into S, which I am thinking of a general human interpretation language. He also is trying to discover TS jointly with F, if I understand correctly.
Yep, I want to jointly find both maps. I don’t necessarily think of S as a human-interpretable format—that’s one potential direction, but I’m also very interested in applications for non-interpretable S, e.g. to mechanistic anomaly detection.
But then I can achieve zero loss by learning an F that maps a state in SM to the uniform probability distribution over SH. [...] So why don’t we try a different method to prevent steganography—reversing the directions of F and G (and keeping the probabilistic modification)
FWIW, the way I hope to deal with the issue of the trivial constant abstraction you mention here is to have some piece of information that we intrinsically care about, and then enforce that this information isn’t thrown away by the abstraction. For example, perhaps you want the abstraction to at least correctly predict that the model gets low loss on a certain input distribution. There’s probably a difference in philosophy at work here: I want the abstraction to be as small as possible while still being useful, whereas you seem to aim at translating to or from the entire human ontology, so throwing away information is undesirable.
Yep, I want to jointly find both maps. I don’t necessarily think of S as a human-interpretable format—that’s one potential direction, but I’m also very interested in applications for non-interpretable S, e.g. to mechanistic anomaly detection.
FWIW, the way I hope to deal with the issue of the trivial constant abstraction you mention here is to have some piece of information that we intrinsically care about, and then enforce that this information isn’t thrown away by the abstraction. For example, perhaps you want the abstraction to at least correctly predict that the model gets low loss on a certain input distribution. There’s probably a difference in philosophy at work here: I want the abstraction to be as small as possible while still being useful, whereas you seem to aim at translating to or from the entire human ontology, so throwing away information is undesirable.