Of all the possible natural latents that exist in some dataset, which ones should we expect a sufficiently advanced AI system to learn? This matters because it seems like there’s a massive number of natural latents present in any dataset that we might be able to learn.
Say that we’ve got a collection of N dogs. The natural latent “dog” would satisfy the redundancy & independence assumption for every dog. But I could also think about the powerset of this collection of N dogs. For each element of the powerset, all the dogs will share the information present in the natural latent ‘dog’, plus some other information about properties that all the elements in the set happen to share as well. For instance, perhaps I’m considering the set of dogs with three legs and grey fur—then the relevant natural latent is “dog + three legs + grey”. There are clearly a huge number of different “composite” natural latents that would work for this collection of dogs.
The problem is, it’s intractable for a sufficiently general AI system to learn ALL of them. So we need some principled criterion to figure out which ones it learns in fact.
Have you considered running this on a dataset of “autonomous weapons activity”. Although Anthropic might feel comfortable with this right now, if it did induce significant EM that might be good reason to avoid any fine-tuning for autonomous weapons use