How do you know?
Hoagy
Sparse Autoencoders Find Highly Interpretable Directions in Language Models
seems like it’d be better formatted as a nested list given the volume of text
Why would we expect the expected level of danger from a model of a certain size to rise as the set of potential solutions grows?
I think both Leap Labs and Apollo Research (both fairly new orgs) are trying to position themselves as offering model auditing services in the way you suggest.
A useful model for why it’s both appealing and difficult to say ‘Doomers and Realists are both against dangerous AI and for safety—let’s work together!’.
Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful?
It’s not PCA but we’ve been using sparse coding to find important directions in activation space (see original sparse coding post, quantitative results, qualitative results).
We’ve found that they’re on average more interpretable than neurons and I understand that @Logan Riggs and Julie Steele have found some effect using them as directions for activation patching, e.g. using a “this direction activates on curse words” direction to make text more aggressive. If people are interested in exploring this further let me know, say hi at our EleutherAI channel or check out the repo :)
Hi, nice work! You mentioned the possibility of neurons being the wrong unit. I think that this is the case and that our current best guess for the right unit is directions in the output space, ie linear combinations of neurons.
We’ve done some work using dictionary learning to find these directions (see original post, recent results) and find that with sparse coding we can find dictionaries of features that are more interpretable the neuron basis (though they don’t explain 100% of the variance).
We’d be really interested to see how this compares to neurons in a test like this and could get a sparse-coded breakdown of gpt2-small layer 6 if you’re interested.
Link at the top doesn’t work for me
I still don’t quite see the connection—if it turns out that LLFC holds between different fine-tuned models to some degree, how will this help us interpolate between different simulacra?
Is the idea that we could fine-tune models to only instantiate certain kinds of behaviour and then use LLFC to interpolate between (and maybe even extrapolate between?) different kinds of behaviour?
For the avoidance of doubt, this accounting should recursively aggregate transitive inputs.
What does this mean?
Importantly, this policy would naturally be highly specialized to a specific reward function. Naively, you can’t change the reward function and expect the policy to instantly adapt; instead you would have to retrain the network from scratch.
I don’t understand why standard RL algorithms in the basal ganglia wouldn’t work. Like, most RL problems have elements that can be viewed as homeostatic—if you’re playing boxcart then you need to go left/right depending on position. Why can’t that generalise to seeking food iff stomach is empty? Optimizing for a specific reward function doesn’t seem to preclude that function itself being a function of other things (which just makes it a more complex function).
What am I missing?
AutoInterpretation Finds Sparse Coding Beats Alternatives
On first glance I thought this was too abstract to be a useful plan but coming back to it I think this is promising as a form of automated training for an aligned agent, given that you have an agent that is excellent at evaluating small logic chains, along the lines of Constitutional AI or training for consistency. You have training loops using synthetic data which can train for all of these forms of consistency, probably implementable in an MVP with current systems.
The main unknown would be detecting when you feel confident enough in the alignment of its stated values to human values to start moving down the causal chain towards fitting actions to values, as this is clearly a strongly capabilities-enhancing process.
Perhaps you could at least get a measure by looking at comparisons which require multiple steps, of human value → value → belief etc, and then asking which is the bottleneck to coming to the conclusion that humans would want. Positing that the agent is capable of this might be assuming away a lot of the problem though.
[Replication] Conjecture’s Sparse Coding in Small Transformers
[Replication] Conjecture’s Sparse Coding in Toy Models
Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did?
In particular, I’m surprised by the method of adding the activations that was chosen because the tokens of the different prompts don’t line up with each other in a way that I would have thought would be necessary for this approach to work, super interesting to me that it does.
If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like:
Take multiple pairs of prompts that differ primarily in the property we’re trying to capture.
Take the difference in the residual stream at the next token.
Take the average difference vector, and add that to every position in the new generated text.
I’d love to know which parts were chosen among many as the ones which worked best and which were just the first/only things tried.
eedly → feedly
Yeah I agree it’s not in human brains, not really disagreeing with the bulk of the argument re brains but just about whether it does much to reduce foom %. Maybe it constrains the ultra fast scenarios a bit but not much more imo.
“Small” (ie << 6 OOM) jump in underlying brain function from current paradigm AI → Gigantic shift in tech frontier rate of change → Exotic tech becomes quickly reachable → YudFoom
Hi Charlie, yep it’s in the paper—but I should say that we did not find a working CUDA-compatible version and used the scikit version you mention. This meant that the data volumes used are somewhat limited—still on the order of a million examples but 10-50x less than went into the autoencoders.
It’s not clear whether the extra data would provide much signal since it can’t learn an overcomplete basis and so has no way of learning rare features but it might be able to outperform our ICA baseline presented here, so if you wanted to give someone a project of making that available, I’d be interested to see it!