Is this thingy an agent trying to optimize for stuff, whatever that means?
Is there some general purpose search module that it uses to aim at things to solve problems and subproblems?
Here’s a a pile of matrices, now print the concepts that the model is using internally to model the environment. In, like, a way that makes sense to a human
Was one of those concepts something like the thing humans want? How about a psychological model of how to get thumbs ups on ratings? Is the search aimed at the second one?
How about corrigibility, if that’s even a natural concept at all? While you’re at it, maybe try seeing if the structure of the thing that looks like corrigibility might teach you something about wtf corrigibility actually is.
etc.
Those would indeed be good. In the 2y since I made that comment I’ve worked on and made progress on one ambitious interp direction, self-supervised internal steering. The idea is to “amplify” honesty or corrigibility without labels or relying on outputs. It might even target deeper concepts, though so far it appears to intervene more at the behaviour level.
My feeling is that interp is held back because researchers aren’t insisting on hard and meaningful metrics and evals, for example doing the things you described, and also out of distribution, without labels. This is very hard, but so is the actual alignment challenge.
The dream is stuff like:
Is this thingy an agent trying to optimize for stuff, whatever that means?
Is there some general purpose search module that it uses to aim at things to solve problems and subproblems?
Here’s a a pile of matrices, now print the concepts that the model is using internally to model the environment. In, like, a way that makes sense to a human
Was one of those concepts something like the thing humans want? How about a psychological model of how to get thumbs ups on ratings? Is the search aimed at the second one?
How about corrigibility, if that’s even a natural concept at all? While you’re at it, maybe try seeing if the structure of the thing that looks like corrigibility might teach you something about wtf corrigibility actually is. etc.
Those would indeed be good. In the 2y since I made that comment I’ve worked on and made progress on one ambitious interp direction, self-supervised internal steering. The idea is to “amplify” honesty or corrigibility without labels or relying on outputs. It might even target deeper concepts, though so far it appears to intervene more at the behaviour level.
My feeling is that interp is held back because researchers aren’t insisting on hard and meaningful metrics and evals, for example doing the things you described, and also out of distribution, without labels. This is very hard, but so is the actual alignment challenge.