Robert Adragna

Karma: 5

Robert Adragna 12 Mar 2026 3:36 UTC
1 point
0
on: Emergent Misalignment and the Anthropic Dispute
Have you considered running this on a dataset of “autonomous weapons activity”. Although Anthropic might feel comfortable with this right now, if it did induce significant EM that might be good reason to avoid any fine-tuning for autonomous weapons use

Robert Adragna 16 Feb 2026 23:25 UTC
2 points
0
on: Natural Latents: The Concepts
Of all the possible natural latents that exist in some dataset, which ones should we expect a sufficiently advanced AI system to learn? This matters because it seems like there’s a massive number of natural latents present in any dataset that we might be able to learn.
Say that we’ve got a collection of N dogs. The natural latent “dog” would satisfy the redundancy & independence assumption for every dog. But I could also think about the powerset of this collection of N dogs. For each element of the powerset, all the dogs will share the information present in the natural latent ‘dog’, plus some other information about properties that all the elements in the set happen to share as well. For instance, perhaps I’m considering the set of dogs with three legs and grey fur—then the relevant natural latent is “dog + three legs + grey”. There are clearly a huge number of different “composite” natural latents that would work for this collection of dogs.
The problem is, it’s intractable for a sufficiently general AI system to learn ALL of them. So we need some principled criterion to figure out which ones it learns in fact.

Robert Adragna 16 Feb 2026 21:46 UTC
5 points
0
in reply to: Alex_Altair’s comment on: Open Thread Winter 2025/26
Hi everyone!
My name is Robert Adragna, and I’ve been working with Dovetail this winter fellowship cohort on Agent Foundations. Specifically, I’ve been trying to better understand what background assumptions the Natural Abstractions Hypothesis (NAH) makes about the world, and whether they might be learned in existing LLM systems. Specific questions that I’m exploring include:
1. Is the Platonic Representation Hypothesis from deep learning evidence for the Natural Abstractions Hypothesis?
2. Is it possible to construct a dataset which represents the world in a completely unbiased way?
3. How can Natural Abstractions be both universal & observer/goal dependant?
4. What would it take to empirically test the NAH?
I’ve been lurking on LessWrong since 2024, when I got interested in AI Safety, and am very excited to spend more time engaging with the community.

Robert Adragna 26 Nov 2023 20:17 UTC
1 point
0
in reply to: da_revo’s comment on: Skillshare: Lock Picking
We’re also trying to find the group