A couple weeks ago, I decided to experiment with vibe coding to see whether an AI agent[1] could do vibe interpretability research. The topic of the experiment was to look into
Given a soft prompt, can you automatically decompose it into a weighted set of hard prompts?
If yes, how interpretable and faithful is that decomposition?
I ran this fairly shortly after posting[2]. The actual vibe research session only lasted about 36 hours, most of which was spent on running some pretty inefficient training code on unnecessarily large batches. I was only present for a couple hours out of that.
Then I went to write it up, and ran into an ugh field of astonishing strength: there was a huge amount of output, and I couldn’t be confident in any of it. Lots of the work looked solid based on Claude’s research log, but there were often ablations or obvious follow-ups that weren’t done (and weren’t called out as such), and when I looked at the scripts corresponding to the research logs the connections between what RESEARCH_LOG.md said and what the script actually did was sometimes tenuous. To be clear, most of the work I looked at had no mistakes that were obvious to me, but the Principle of Least Astonishment was violated often enough in the areas I spot-checked that I didn’t trust the rest of the areas.
Exacerbating this is the fact that I’m not an expert in everything here. I know enough to recognize that certain parts of the codebase Claude made smell suspicious, but not enough that I’m confident in catching every mistake.
My takeaways from this experiment are
I still don’t know if Claude is capable of doing reasonable-quality vibe research with the right scaffolding and person driving the research, but with me driving the research the answer is “no”, because I don’t have the skills necessary to evaluate Claude’s output.
[low confidence] Soft prompts are probably not a very exciting interpretability technique, even if decomposing them into hard prompts works.
Doing code review for AI-generated code is slightly annoying on codebases I know deeply, but doing code review for AI-generated code where I don’t have a very deep understanding of the underlying domain is SO MUCH WORSE. How do people do it???
I really want someone[3] to make a benchmark for reviewability / least astonishment for LLM code, so that the big labs go hill-climb on it in such a way that I can actually trust that the code does what I would expect it to do from the summaries and function names.
This will probably be my last post regarding this experiment.
The answers to those questions, as best I can tell, are “kinda”, “kinda”, and “not really” respectively. However, “as best I can tell” is “not very well”.
Ideally not me, because the free time I can dedicate to this stuff is on the order of 3-5 hours / week and so if I’m the one making LeastAstonishmentBench I expect it’ll be good to go sometime in 2029
A couple weeks ago, I decided to experiment with vibe coding to see whether an AI agent [1] could do vibe interpretability research. The topic of the experiment was to look into
Given a soft prompt, can you automatically decompose it into a weighted set of hard prompts?
If yes, how interpretable and faithful is that decomposition?
Can you approximate a LoRA using a soft prompt?
I ran this fairly shortly after posting [2] . The actual vibe research session only lasted about 36 hours, most of which was spent on running some pretty inefficient training code on unnecessarily large batches. I was only present for a couple hours out of that.
Then I went to write it up, and ran into an ugh field of astonishing strength: there was a huge amount of output, and I couldn’t be confident in any of it. Lots of the work looked solid based on Claude’s research log, but there were often ablations or obvious follow-ups that weren’t done (and weren’t called out as such), and when I looked at the scripts corresponding to the research logs the connections between what RESEARCH_LOG.md said and what the script actually did was sometimes tenuous. To be clear, most of the work I looked at had no mistakes that were obvious to me, but the Principle of Least Astonishment was violated often enough in the areas I spot-checked that I didn’t trust the rest of the areas.
Exacerbating this is the fact that I’m not an expert in everything here. I know enough to recognize that certain parts of the codebase Claude made smell suspicious, but not enough that I’m confident in catching every mistake.
My takeaways from this experiment are
I still don’t know if Claude is capable of doing reasonable-quality vibe research with the right scaffolding and person driving the research, but with me driving the research the answer is “no”, because I don’t have the skills necessary to evaluate Claude’s output.
[low confidence] Soft prompts are probably not a very exciting interpretability technique, even if decomposing them into hard prompts works.
Doing code review for AI-generated code is slightly annoying on codebases I know deeply, but doing code review for AI-generated code where I don’t have a very deep understanding of the underlying domain is SO MUCH WORSE. How do people do it???
I really want someone [3] to make a benchmark for reviewability / least astonishment for LLM code, so that the big labs go hill-climb on it in such a way that I can actually trust that the code does what I would expect it to do from the summaries and function names.
This will probably be my last post regarding this experiment.
Claude Opus 4.6 scaffholded by Claude Code on a box with a single H100
The answers to those questions, as best I can tell, are “kinda”, “kinda”, and “not really” respectively. However, “as best I can tell” is “not very well”.
Ideally not me, because the free time I can dedicate to this stuff is on the order of 3-5 hours / week and so if I’m the one making LeastAstonishmentBench I expect it’ll be good to go sometime in 2029