kave comments on AI for AI safety

kave 15 Mar 2025 19:47 UTC
LW: 28 AF: 14
2
AF
You claim (and I agree) that option control will probably not be viable at extreme intelligence levels. But I also notice that when you list ways that AI systems help with alignment, all but one (maybe two), as I count it, are option control interventions.
evaluating AI outputs during training, labeling neurons in the context of mechanistic interpretability, monitoring AI chains of thought for reward-hacking behaviors, identifying which transcripts in an experiment contain alignment-faking behaviors, classifying problematic inputs and outputs for the purpose of preventing jailbreaks
I think “labeling neurons” isn’t option control. Detecting alignment-faking also seems marginal; maybe it’s more basic science than option control.
I think mech interp is proving to be pretty difficult, in a similar way to human neuroscience. My guess is that even if we can characterise the low-level behaviour of all neurons and small circuits, we’ll be really stuck with trying to figure out how the AI minds work, and even more stuck trying to turn that knowledge into safe mind design, and even more even more stuck trying to turn that knowledge differentially into safe mind design vs capable mind design.
Will we be able to get AIs to help us with this higher-level task as well? The task of putting all the data and experiments together and coming up with a theory that explains how they behave. I think they probably can just if they could do the same for human neuroscience. And my weak guess is that, if there’s a substantial sweet spot, they will be able to do the same for human neuroscience.
But I’m not sure how well we’ll be able to tell that they have given us a correct theory? They will produce some theory of how the brain or a machine mind works, and I don’t know (genuinely don’t know) whether we will be able to tell if it’s a subtly wrong theory. It does seem pretty hard to produce a small theory, that makes a bunch of correct empirical predictions, but has some (intentional or unintentional) error that is a vector for loss-of-control. So maybe reality will come in clutch with some extra option control at the critical time.
Your taxonomies of the space of worries and orientations to this question are really good, and I think well capture my concerns above. But I wanted to spell out my specific concerns because things will succeed or fail for specific reasons.
What links here?
- johnswentworth's comment on AI for AI safety by Joe Carlsmith (4 Apr 2025 19:54 UTC; 31 points)