The F1 scores are impressive but the MCC is still substantially below 1. Is it odd that each feature in the activations has an associated SAE latent that fires exactly when it ought to and yet its decoder directions are still pretty misaligned? Is this hedging? Do your SAEs have exactly 16k latents?
The SAEs have 4096 latents, so intentionally more narrow than the synthetic model. The idea was that we’re almost certainly never training SAEs that have the full number of features of an LLM, we should also make sure the SAEs here are also intentionally too narrow.
I was also surprised that this doesn’t mess up the F1 probing of the SAE more—I assumed that hedging due to the SAE being too narrow would make it impossible for the encoder to act as that accurate of a probe, but that’s seemingly not the case!
I also tried training a 4096 width decoder on the ground-truth activations to get a sense of what the ceiling is for MCC with a perfect encoder given the SAE width, and it gets MCC around 0.87, so there’s definitely more room for improvement on that metric. I’m not sure there’s a way to get above 0.87 without some novel reconstruction loss or something though with only 4096 latents.
Before applying this new SAE to language models, you could see how hyperparameter-sensitive it is by creating multiple variants of SynthSAEBench with different correlation structure, Zipfian exponents over feature firing probabilities, etc. and see how well the optimal SAE hyperparams transfer from one to another.
This is what I plan to do next! I suspect a lot of the high scores here are just Claude over-optimizing for this specific synthetic model, so making a suite of models with different properties should hopefully make for a more robust test-bed.
There’s really nothing to the setup, it’s just the
TASK.mdfile, and literally prompting Claude “follow the instructions in TASK.md”. I used the official Ralph Wiggum Plugin for Claude Code to do the looping. I have a Claude max subscription so I’m not sure what the cost would have been, but honestly I don’t think it uses that many tokens since most of the time Claude is just waiting around for Python code to run on the GPU.I was just manually editing
TASK.mdwhile Claude was running based on what I saw it doing in its sprints, so the next sprint would read the modifiedTASK.md. Mostly this was in the form of editing the “ideas to try” section of the task file. This was a really low-tech procedure, I’m sure there are better ways to do this!@Bart Bussmann mentioned https://github.com/Butanium/claude-lab/ which looks really cool! I may try this out as well, I feel like what I did here is the caveman version of automonous research.
That’s a good idea—I added a sample report PDF from one of the sprints to https://github.com/chanind/claude-auto-research-synthsaebench/blob/main/sample_sprint_report.pdf.