Wanting to raise kids/have what would normally be considered a lineage is importantly different from IGF; most people would not consider sperm bank donation to satisfy their child-having goals very well despite this being very good for IGF.

# leogao

For what it’s worth, my view on the value of conferences is that a huge proportion of the value comes from meeting new people, catching up with people who live elsewhere, having research discussions, etc. I’ve occasionally found out about a paper I wouldn’t have otherwise, but this is a smaller fraction of the value for me. Language model research is generally behind the state of the art of what’s available inside labs, and a huge fraction of papers at conferences won’t replicate or are too toy or otherwise end up never becoming useful.

tourist traps are the lubricant between the expectations and realities of travel

How do you sample uniformly from the integers?

Very exciting that JumpReLU works well with STE gradient estimation! I think this fixes one of the biggest flaws with TopK, which is that having a fixed number of latents k on each token is kind of wonky. I also like the argument in section 4 a lot—in particular the point about how this works because we’re optimizing the expectation of the loss. Because of how sparse the features are, I wonder if it would reduce gradient noise substantially to use a KDE with state persisting across a few recent steps.

I’m 80% that with optimal hyperparameters for both (you need to retune hparams when you change batch size),

^{131072}⁄_{64}is substantially less efficient than 131072.We find that at a batch size of 131072, when hyperparameters are tuned, then the training curves as a function of number of tokens are roughly the same as with a batch size of 4096 (see appendix A.4). So it is not the case that 131072 is in a degenerate large batch regime where efficiency is substantially degraded by batch size.

When your batch is not fully iid, this is like effectively having a smaller batch size of iid data (in the extreme, if your batch contains 64 copies of the same data, this is obviously the same as a 64x smaller batch size), but you still pay the compute cost of putting all 131072 tokens through the model.

Some takes on some of these research questions:

Looking for opposing feature directions in SAEs

I checked a top-k SAE with 256k features and k=256 trained on GPT-4 and found only 286 features that had any other feature with cosine similarity < −0.9, and 1314 with cosine sim < −0.7.

SAE/Transcoder activation shuffling

I’m confident that when learning rate and batch size are tuned properly, not shuffling eventually converges to the same thing as shuffling. The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I’d guess is probably substantial.

How much does initializing the encoder to be the transpose of the decoder (as done so here and here) help for SAEs and transcoders?

It helps tremendously for SAEs by very substantially reducing dead latents; see appendix C.1 in our paper.

Cool work—figuring out how much of scaling up autoencoders is discovering new features vs splitting existing ones feels quite important. Especially since for any one scale of autoencoder there are simultaneously features which are split too finely and features which are too rare to yet be discovered, it seems quite plausible that the most useful autoencoders will be ones with features stitched together from multiple scales.

Some minor nitpicks: I would recommend always thinking of MSE/L0 in terms of the frontier between the two, rather than either alone; in my experiments I found it very easy to misjudge at a glance whether a run with better MSE but worse L0 was better or worse than the frontier.

I think something based on prediction markets can counteract this kind of war-of-attrition strategy. There are two main advantages of this solution: (a) it requires users to stake their reputation on their claims, and so if you ever double down really really hard on something that’s obviously wrong, it will cost you a lot, and (b) in general prediction markets solve the problem of providing a cheap way to approximate a very expensive process if it’s obvious to everyone what the output of the very expensive process will be, which nullifies an entire swathe of bad-faith arguing techiques.

To avoid the Arbital failure mode, I think the right strategy is to (i) start simple and implement one feature at a time and see how it interacts with actual conversations (every successful complex system grows out of a simple one—maybe we can start with literally just a LW clone but the voting algorithm is entirely using the community notes algorithm), and (ii) for the people implementing the ideas to be basically the same people coming up with the ideas.

VAEs:

a normal autoencoder decodes single latents z to single images (or whatever other kind of data) x, and also encodes single images x to single latents z.

with VAEs, we want our decoder (p(x|z)) to take single latents z and output a distribution over x’s. for simplicity we generally declare that this distribution is a gaussian with identity covariance, and we have our decoder output a single x value that is the mean of the gaussian.

because each x can be produced by multiple z’s, to run this backwards you also need a distribution of z’s for each single x. we call the ideal encoder p(z|x) - the thing that would perfectly invert our decoder p(x|z). unfortunately, we obviously don’t have access to this thing. so we have to train an encoder network q(z|x) to approximate it. to make our encoder output a distribution, we have it output a mean vector and a stddev vector for a gaussian. at runtime we sample a random vector eps ~ N(0, 1) and multiply it by the mean and stddev vectors to get an N(mu, std).

to train this thing, we would like to optimize the following loss function:

-log p(x) + KL(q(z|x)||p(z|x))

where the terms optimize the likelihood (how good is the VAE at modelling data, assuming we have access to the perfect z distribution) and the quality of our encoder (how good is our q(z|x) at approximating p(z|x)). unfortunately, neither term is tractable—the former requires marginalizing over z, which is intractable, and the latter requires p(z|x) which we also don’t have access to. however, it turns out that the following is mathematically equivalent and is tractable:

-E z~q(z|x) [log p(x|z)] + KL(q(z|x)||p(z))

the former term is just the likelihood of the real data under the decoder distribution given z drawn from the encoder distribution (which happens to be exactly equivalent to the MSE, because it’s the log of gaussian pdf). the latter term can be computed analytically, because both distributions are gaussians with known mean and std. (the distribution p is determined in part by the decoder p(x|z), but that doesn’t pin down the entire distribution; we still have a degree of freedom in how we pick p(z). so we typically declare by fiat that p(z) is a N(0, 1) gaussian. then, p(z|x) is implied to be equal to p(x|z) p(z) / sum z’ p(x|z’) p(z’))

learning thread for taking notes on things as i learn them (in public so hopefully other people can get value out of it)

I’d be excited about a version of Wikipedia that is built from the ground up to operate in an environment where truth is difficult to find and there is great incentive to shape the discourse. Perhaps there are new epistemic technologies similar to community notes that are yet to be invented.

One problem is that the outcome may itself be strongly correlated with the S&P 500, which would mess up the probabilities

I like the aesthetics of this idea and am vaguely interested in the idea of making this happen (though most likely I will never get around to doing it)

I’ve been told by Gabriel Goh that the CLIP neurons are apparently surprisingly sparse, so the neuron baseline might be stronger relative to SAEs than in e.g LLMs. (This is also why the Multimodal Neurons work was possible without SAEs)

Did you use the initialization scheme in our paper where the decoder is initialized to the transpose of the encoder (and then columns unit normalized)? There should not be any dead latents with topk at small scale with this init.

Also, if I understand correctly, leaky topk is similar to the multi-topk method in our paper. I’d be interested in a comparison of the two methods.

(In fact, we know that the fraction of features extracted is probably quite small—for example, the 16M latent GPT-4 autoencoder only captures 10% of the downstream loss in terms of equivalent pretraining compute.)

(I’m the first author of the linked paper on GPT-4 autoencoders.)

I think many people are heavily overrating how human-explainable SAEs today are, because it’s quite subtle to determine whether a feature is genuinely explainable. SAE features today, even in the best SAEs, are generally are not explainable with simple human understandable explanations. By “explainable,” I mean there is a human understandable procedure for labeling whether the feature should activate on a given token (and also how strong the activation should be, but I’ll ignore that for now), such that your procedure predicts an activation

*if and only if*the latent actually activates.There are a few problems with interpretable-looking features:

it is insufficient that latent-activating samples have a common explanation. You also need the opposite direction of things that match the explanation to activate the latent. For example, we found a neuron in GPT-2 that appears to activate on the word “stop,” but actually most instances of the word “stop” don’t activate the neuron. It turns out that this was not really a “stop” neuron, but rather a “don’t stop/won’t stop” neuron. While in this case there was a different but still simple explanation, it’s entirely plausible that many features just cannot be explained with simple explanations. This problem gets worse as autoencoders scale, because their explanations will get more and more specific.

People often look at the top activating examples of a latent, but this provides a heavily misleading picture of how monosemantic the latent is even just in the one direction. It’s very common for features to have extremely good top activations but then terrible nonzero activations. This is why our feature visualizer shows random nonzero activations before the top activations.

Oftentimes, it is actually harder to simulate a latent than it looks. For example, we often find latents that activate on words in a specific context- say, financial news articles- but it seems to activate on random words inside those contexts and we don’t have a good explanation why it activates on some words but not others.

We also discuss this in the evaluation section of our paper on GPT-4 autoencoders. The ultimate metric of whether the features are explainable that we introduce is the following: simulate each latent with your best explanation of the latent, and then run the values through the decoder and the rest of the model and look at the downstream loss. This procedure is very expensive, so making it feasible to run is a nontrivial research problem, but I predict basically all existing autoencoders will score terribly on this metric.

We had done very extensive ablations at small scale where we found TopK to be consistently better than all of the alternatives we iterated through, and by the time we launched the big run we had already worked out how to scale all of the relevant hyperparameters, so we were decently confident.

One reason we might want a progressive code is it would basically let you train one autoencoder and use it for any k you wanted to at test time (which is nice because we don’t really know exactly how to set k for maximum interpretability yet). Unfortunately, this is somewhat worse than training for the specific k you want to use, so our recommendation for now is to train multiple autoencoders.

Also, even with a progressive code, the activations on the margin would not generally be negative (we actually apply a ReLU to make sure that the activations are definitely non-negative, but almost always the (k+1)th value is still substantially positive)

I bet that the size of the difference from actually tuning the piano perfectly is smaller than the difference from merely telling the audience that the piano has been perfectly tuned.