Great! I’m looking forward to seeing what comes out of that. There are a few aspects of this work that aren’t cooperation-related but that I’m very curious about (I’ve been quite interested recently in self-modeling in general and introspection on activations in particular).
A few specific things that seem valuable that I wonder whether you’re planning to investigate:
It seems like one important part of the value proposition here is easier interpretability, and I wonder whether you’re thinking about investigating that directly using existing mech interp techniques.
One interpretability question I’ve been quite curious about (separately from the overall regularization) is what the circuitry for self-prediction ends up looking like. I can imagine, on the one hand, there being a fairly direct link from a neuron being predicted to the prediction output. On the other hand, I could imagine that not being true at all, that instead the self-prediction circuitry has learned to duplicate what the original neuron is doing (presumably as an approximated function from input to activation strength). Or the model could be doing something different from either of these.
Presumably you could accomplish at least some of the same effects with fine-tuning, and I wonder whether you’re likely to try that with LLMs.
I’m curious also whether you’ve released code (I don’t see it linked from the paper) and/or whether you’d potentially be open to sharing it.
Of course feel free to decide to keep all this private! Or to reach out by private message and we could talk there more privately or set up a chat if you’re interested.
I can imagine, on the one hand, there being a fairly direct link from a neuron being predicted to the prediction output.
Is something like this what you were getting at in other comments with respect to the model doing self-prediction by learning the identity matrix? That wasn’t quite clear to me.
Is there continuing work in this direction? It seems like a thread worth following.
Thanks for the interest! We are actively pursuing follow up work with AST inspired approaches to cooperation.
Great! I’m looking forward to seeing what comes out of that. There are a few aspects of this work that aren’t cooperation-related but that I’m very curious about (I’ve been quite interested recently in self-modeling in general and introspection on activations in particular).
A few specific things that seem valuable that I wonder whether you’re planning to investigate:
It seems like one important part of the value proposition here is easier interpretability, and I wonder whether you’re thinking about investigating that directly using existing mech interp techniques.
One interpretability question I’ve been quite curious about (separately from the overall regularization) is what the circuitry for self-prediction ends up looking like. I can imagine, on the one hand, there being a fairly direct link from a neuron being predicted to the prediction output. On the other hand, I could imagine that not being true at all, that instead the self-prediction circuitry has learned to duplicate what the original neuron is doing (presumably as an approximated function from input to activation strength). Or the model could be doing something different from either of these.
Presumably you could accomplish at least some of the same effects with fine-tuning, and I wonder whether you’re likely to try that with LLMs.
I’m curious also whether you’ve released code (I don’t see it linked from the paper) and/or whether you’d potentially be open to sharing it.
Of course feel free to decide to keep all this private! Or to reach out by private message and we could talk there more privately or set up a chat if you’re interested.
Oh, and a question. I say above:
Is something like this what you were getting at in other comments with respect to the model doing self-prediction by learning the identity matrix? That wasn’t quite clear to me.