Hi everyone!
My name is Robert Adragna, and I’ve been working with Dovetail this winter fellowship cohort on Agent Foundations. Specifically, I’ve been trying to better understand what background assumptions the Natural Abstractions Hypothesis (NAH) makes about the world, and whether they might be learned in existing LLM systems. Specific questions that I’m exploring include:
Is the Platonic Representation Hypothesis from deep learning evidence for the Natural Abstractions Hypothesis?
Is it possible to construct a dataset which represents the world in a completely unbiased way?
How can Natural Abstractions be both universal & observer/goal dependant?
What would it take to empirically test the NAH?
I’ve been lurking on LessWrong since 2024, when I got interested in AI Safety, and am very excited to spend more time engaging with the community.
Of all the possible natural latents that exist in some dataset, which ones should we expect a sufficiently advanced AI system to learn? This matters because it seems like there’s a massive number of natural latents present in any dataset that we might be able to learn.
Say that we’ve got a collection of N dogs. The natural latent “dog” would satisfy the redundancy & independence assumption for every dog. But I could also think about the powerset of this collection of N dogs. For each element of the powerset, all the dogs will share the information present in the natural latent ‘dog’, plus some other information about properties that all the elements in the set happen to share as well. For instance, perhaps I’m considering the set of dogs with three legs and grey fur—then the relevant natural latent is “dog + three legs + grey”. There are clearly a huge number of different “composite” natural latents that would work for this collection of dogs.
The problem is, it’s intractable for a sufficiently general AI system to learn ALL of them. So we need some principled criterion to figure out which ones it learns in fact.