Letting Claude do Autonomous Research to Improve SAEs

This work was done as part of MATS 7.1

I pointed Claude at our new synthetic Sparse Autoencoder benchmark, told it to improve Sparse Autoencoder (SAE) performance, and left it running overnight. By morning, it had boosted F1 score from 0.88 to 0.95. Within another day, with occasional input from me, it had matched the logistic regression probe ceiling of 0.97 -- a score I honestly hadn’t thought was possible for an SAE on this benchmark.

The most surprising development was when Claude autonomously found a dictionary-learning paper from 2010, turned its algorithm into an SAE encoder, and Matryoshka-ified it, improving performance by a few percentage points in the process. I had never heard of this algorithm before (although I really should have).

In this post, I’ll describe the setup, walk through the improvements Claude found, and discuss what this experiment taught me about the strengths and weaknesses of autonomous AI research.

We haven’t yet verified how well these improvements transfer to LLM SAEs, so don’t rush to implement every change mentioned here into your SAEs just yet! We’ll discuss challenges and next-steps for LLM verification at the end of the post.

The TASK.md we gave Claude and resulting SAE code is available on Github.

The resulting SAE, called “LISTA-Matryoshka” here, outperforms all standard SAEs tested and matches the performance of logistic-regression probes on SynthSAEBench-16k at L0=25.

The setup

We recently released a synthetic SAE benchmark called SynthSAEBench. The benchmark contains a synthetic model with 16k ground-truth features (SynthSAEBench-16k). We intentionally designed this model to be difficult for SAEs, including known challenges like hierarchical features, feature correlations, and feature superposition. In the paper, we found that the best SAE architecture we tested, the Matryoshka SAE, only achieves an F1 score of 0.88, compared to an F1 score of 0.97 achieved by a logistic regression probe. The best SAE also only achieved an average cosine similarity (MCC) of 0.78 between its learned latent directions and the ground-truth feature directions. For more details on these metrics, see the paper.

Training an SAE on SynthSAEBench-16k takes about ~20 minutes on a single GPU, making it a nice test-bed for rapid iteration. I set up Claude Code on a server and ran it in a Ralph Wiggum loop, where each iteration Claude conducts a “research sprint”: it generates an idea, implements it, runs the experiment, and writes up a report. I’d steer Claude lightly by adding or removing ideas in a TASK.md file, but it was largely autonomous. The full TASK.md file is available here.

SAE improvements

The following table summarizes the components Claude figured out to increase F1 score from 0.88 to 0.97 and MCC from 0.78 to 0.84. See the Appendix for full details on each.

ImprovementDescriptionOrigin
Linearly decrease KStart with higher K and anneal to target K during training. Similar to Anthropic’s JumpReLU training recommendation.Found in my repo, Claude tried it autonomously
Detach inner Matryoshka levelsDetach gradients between Matryoshka levels except the outermost, so inner levels only get gradients from the full-width reconstruction.Found in my repo, hinted by me
LISTA encoderUse a single iteration of LISTA (a neural approximation to classical sparse coding) as the SAE encoder.Claude innovation
TERM lossFrom this paper: up-weight high-loss samples via a tilted exponential. Claude repurposed it as a general SAE training improvement with a small tilt coefficient (~2e-3). Very minor improvement.Claude innovation
Sort Matryoshka levels by frequencyDynamically sort latents by firing frequency before applying Matryoshka losses, improving stability and dead latent revival.My idea, Claude implemented

Claude also tried plenty of ideas that did not work, which I won’t list here, but this is part of the research process!

Some of these ideas were components Claude found in my SAE experiments repo, some were ideas I suggested, but the ones that impressed me most—LISTA and TERM loss—were fully Claude’s own initiative. In both cases, Claude found a relevant paper online, adapted the idea to SAEs, and tested it without any prompting from me.

Diving deeper: LISTA encoder

Claude’s idea to remix LISTA into an SAE and Matryoshka-ify it really amazed me, as this is something I would not have thought of and was not even aware of LISTA before. However, this is probably an obvious thing to try if you were an expert on both modern SAEs and classical dictionary learning. Claude’s implementation of the LISTA BatchTopK encode is shown in pytorch pseudo-code below:

def encode(sae_in, W_enc, W_dec, b_enc, b_dec, k, eta=0.3, n_iterations=1)
	# standard BatchTopK SAE
	sae_in_centered = sae_in - b_dec
	hidden_pre = W_enc @ sae_in_centered + b_enc
	latent_acts = batch_topk(hidden_pre, k)
	residual = sae_in_centered
	# iteratively refine initial encoding
	for _ in range(n_iterations):
		residual = residual - (latent_acts @ W_dec)
		correction = W_enc @ residual
		hidden_pre += eta * correction
		latent_acts = batch_topk(hidden_pre, k)
	return latent_acts

The idea is to iteratively refine the SAE prediction over a number of steps, ultimately converging to the “optimal” latent activations. The version in the LISTA paper is even more general than this, effectively with a learned W_enc, b_enc, and W_dec per-iteration, with eta also being learned, while Claude’s version reuses W_enc across iterations and sets b_enc = 0 for each iteration after the initial SAE encode. The original LISTA work doesn’t try to both learn the dictionary and the encoder at the same time, but rather tries to have the learned encoder approximate ISTA (where W_enc = W_dec.T), so it was surprising to me that this works if you just backprop through everything like Claude does.

In follow-up investigations with Claude, it seems like deviating from this formula results in worse results. E.g. more than 3 iterations, learning eta, learning multiple W_enc /​ b_enc, etc… all seems to lead the SAE to overfit and no longer track the ground-truth features well (but with higher variance explained).

I’m a bit uneasy about running backprop through the full encode, as it will put gradient pressure on latents that don’t ultimately end up in the final latent_acts, and thus do not get reconstruction pressure. However, it also seems like trying to block gradients to latents that don’t ultimately get selected doesn’t work well for reasons I don’t yet fully understand.

While I’m not confident this works in LLM SAEs yet (results so far have been mixed), this is very-much the type of thing I would expect to work well. An SAE can be viewed a single step of the LISTA algorithm, and in theory a single step should not perform particularly well. It doesn’t seem crazy that doing 2 steps, or 1.5 steps, or whatever Claude came up with exactly could help things. Doing too many steps seems to make it easy for the SAE to find creative ways to overfit (abusing correlations or superposition noise, for example).

Validating on LLMs with SAEBench

I’ve been trying to validate that these ideas improve performance on LLM SAEs using SAEBench, but have so far not been able to prove anything decisively. The core problem is that SAEBench metrics are noisy: you need multiple seeds, multiple L0 values, and results often point in different directions (e.g. TPP increases but SCR decreases). Properly evaluating a single architecture change can easily cost $1000+ in compute, which is prohibitive for an independent researcher without strong prior confidence that the results will be clear.

So far, LISTA with eta=0.3 seems to break on LLMs, and with lower eta it’s hard to distinguish signal from noise. Some changes—like Matryoshka frequency sorting—are almost certainly improvements, but proving this rigorously will require training a lot more LLM SAEs.

Regardless, whether or not these improvements ultimately translate to LLMs is not Claude’s fault. Claude crushed the task I set out for it, which was to make SAE architectural improvements that increase F1 score and MCC on SynthSAEBench-16k.

Claude’s research strengths and weaknesses

Overall I was very impressed with Claude Opus 4.6′s ability to do autonomous research. It came up with sprint ideas, ran them itself, summarized results, and then built on what worked. I was most impressed with its ability to find random research papers online and test out ideas from them without much prompting from me (aside from telling it to spend time looking at related fields before starting the sprint).

I thought the LISTA idea was particularly brilliant and is not something I would have come up with, but is probably obvious to someone who’s an expert in classical dictionary learning. I think a big strength of these models is that they are very knowledgeable on basically every field, so if some idea would be obvious to someone who’s an expert in a field I’m not an expert in, the model is likely to try ideas I wouldn’t think to.

That being said, a lot of the other ideas Claude tried were either hinted at by me or were floating around in my SAE research repo that Claude perused. I find that once I hinted to Claude to try an idea by adding it to the ideas list, Claude was very capable of understanding the idea, coding it up, and testing it out, but for many of these ideas I’m not sure if it would have come up with them itself without this hinting.

One thing I noticed is that Claude tends to be over-confident in its interpretation of the results of its sprints, without thinking through all the possible reasons why the sprint may have gone wrong or what alternative explanations might be for the results. For instance, one sprint involved Claude having an implementation bug that resulted in the sprint not actually testing anything and Claude then confidently declared the idea didn’t work. Once I pointed out to check if the code was actually running, Claude realized its mistake and redid the sprint. I do worry that the conclusions Claude draws are not always the most rigorously tested, but it’s a cheap way to test out a lot of ideas quickly.

I also found that Claude tends to get stuck building on the first things it finds that seem to work, rather than trying a broad set of very different ideas. It took a bit of nudging to get Claude to try completely different ideas since it sees its previous sprints and this seems to bias it to think about those past sprints. I suspect it should be possible to get around this by either not letting it see the past sprints, or doing a separate “idea generation” session outside of a single sprint, where you can collaboratively come up with sprint ideas to try.

I’ve also found that having Claude run these sprints solves a focus problem I struggle with in ML research, where I find it’s just so hard to stay in flow when you constantly need to run something and check back in 1 hour. I don’t like constant context switching, and tend to get distracted instead. Claude doesn’t get distracted, and will diligently run the next step 1 hour later and keep going until everything is completed and written up.

Overall this feels like having a really fast and extremely smart masters student who can iterate quickly but could use a little bit of guidance. I also think this setup benefits from having clear numbers to optimize and a relatively quick iteration cycle. I don’t think this would have as much success if Claude had to train LLM SAEs and run SAEBench, for example.

Next steps

I now have a setup where I can propose an idea to Claude and then have it go off and investigate it, do a sprint, write up a report, and ping me when it’s done. I’d love to have this integrated into Slack too, so I can just chat with it in a thread and have it run sprints and put the results and PDF reports into the Slack channel.

So far I’ve only had Claude trying to maximize scores on the single SynthSAEBench-16k model, and it has done an amazing job at that, but I suspect part of the success is that it’s hill-climbed a bit too much on that specific model. I’ll next try creating a suite of synthetic models with varying properties to make sure the ideas Claude comes up with are not over-fit to this specific synthetic model.

Finally, we need to get better at evaluating on LLM SAEs /​ SAEBench. This could look like trying to really expand the quantity and quality of datasets in each metric (maybe I can ask Claude to do this), or might just involve getting more compute funding to test these ideas out properly with multiple seeds per SAE. I’d be curious to hear any ideas on this from others in the community too!

Give it a try!

I found having Claude autonomously try out SAE architecture ideas on SynthSAEBench to be surprisingly easy. You can check out the code for the SAE Claude came up with and a version of the TASK.md prompt at https://​​github.com/​​chanind/​​claude-auto-research-synthsaebench. Try it out!


Appendix: Improvement details

Linearly decrease K during training

Claude found that starting with a higher K and linearly decreasing down to the target K during training seems to help the resulting SAE quality. This is implicitly similar to how Anthropic recommends training JumpReLU SAEs, so it’s not shocking this would help BatchTopK SAEs too.

This setting was an option in my SAE repo, but Claude saw it, tried setting it, and found good results.

Detach inner Matryoshka levels, but not the final level

Matryoshka SAEs take prefixes of fixed size (called levels here), and sum these all together during training. This makes it like training SAEs of different widths that happen to share latents. Claude figured out that it improves performance to detach the gradients between each matryoshka level except for the outer-most level. So if a Matryoshka SAE is trained with levels [128, 512, 2048, 4096], where 4096 is the full width of the SAE, the 128 level receives no gradient from levels 512 and 2048, but does receive a gradient from the full 4096 reconstruction.

This setting was an option in my SAE repo, and something I mentioned as an idea in the task.

LISTA encoder

Claude found a dictionary-learning paper from 2010 called “Learning Fast Approximations of Sparse Coding” that uses a neural-network to approximate a classical dictionary learning technique called Iterative Shrinkage and Thresholding Algorithm (ISTA). Claude whipped up an SAE version of this, using LISTA for the encoder, and also remixed a Matryoshka version.

Claude found that using a single iteration yields best results, using a weighting of 0.3 for the adjustment after each iteration. I was really amazed by Claude here, as I would never have come up with a LISTA SAE, especially one where you intentionally train only 1 iteration rather than letting it converge. Claude’s implementation also just backprops through the iterations during training, which I would not have thought would work, but it seems to!

I had not heard of LISTA before (although I really should have in retrospect), and struggle with traditional dictionary learning papers in general.

TERM loss

Claude found the paper Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models, which has a loss called TERM that up-weights training samples that have large loss to encourage SAE training to focus more on these samples. The formula for TERM loss is the following, where is the normal SAE loss for a sample and is the number of samples in a batch, and is a tilt coefficient that determines how skewed the loss is towards high-loss samples.

Interestingly, the paper doesn’t even suggest this as a way to improve SAE performance, but Claude just did this anyway and found that using TERM with a small coefficient (~2e-3) seems to help SAE quality. This is a pretty minor improvement, but still a really interesting idea. It’s possible that more tweaks to standard SAE loss like this could help improve performance as well.

Dynamic Matryoshka levels by firing frequency

Normally, Matryoshka SAEs enforce that the earlier latent indices must learn higher-frequency concepts. However, we already track latent firing frequencies during SAE training, so we can dynamically sort the latents by firing frequency before applying the Matryoshka losses. A more rigorous version of this would probably be to sort by expected MSE (expected firing magnitude squared). This helps training stability since if a later latent happens to learn a higher frequency concept, it does not need to unlearn it during training. This also helps with dead latent revival, since dead latents are always implicitly revived into the outer-most matryoshka level.