faul_sname comments on faul_sname’s Shortform

faul_sname 5 Mar 2026 9:02 UTC
16 points
0
A couple weeks ago, I decided to experiment with vibe coding to see whether an AI agent ^[1] could do vibe interpretability research. The topic of the experiment was to look into
1. Given a soft prompt, can you automatically decompose it into a weighted set of hard prompts?
2. If yes, how interpretable and faithful is that decomposition?
3. Can you approximate a LoRA using a soft prompt?
I ran this fairly shortly after posting ^[2] . The actual vibe research session only lasted about 36 hours, most of which was spent on running some pretty inefficient training code on unnecessarily large batches. I was only present for a couple hours out of that.

Then I went to write it up, and ran into an ugh field of astonishing strength: there was a huge amount of output, and I couldn’t be confident in any of it. Lots of the work looked solid based on Claude’s research log, but there were often ablations or obvious follow-ups that weren’t done (and weren’t called out as such), and when I looked at the scripts corresponding to the research logs the connections between what RESEARCH_LOG.md said and what the script actually did was sometimes tenuous. To be clear, most of the work I looked at had no mistakes that were obvious to me, but the Principle of Least Astonishment was violated often enough in the areas I spot-checked that I didn’t trust the rest of the areas.

Exacerbating this is the fact that I’m not an expert in everything here. I know enough to recognize that certain parts of the codebase Claude made smell suspicious, but not enough that I’m confident in catching every mistake.

My takeaways from this experiment are
1. I still don’t know if Claude is capable of doing reasonable-quality vibe research with the right scaffolding and person driving the research, but with me driving the research the answer is “no”, because I don’t have the skills necessary to evaluate Claude’s output.
2. [low confidence] Soft prompts are probably not a very exciting interpretability technique, even if decomposing them into hard prompts works.
3. Doing code review for AI-generated code is slightly annoying on codebases I know deeply, but doing code review for AI-generated code where I don’t have a very deep understanding of the underlying domain is SO MUCH WORSE. How do people do it???
4. I really want someone ^[3] to make a benchmark for reviewability / least astonishment for LLM code, so that the big labs go hill-climb on it in such a way that I can actually trust that the code does what I would expect it to do from the summaries and function names.
This will probably be my last post regarding this experiment.
1. ↩︎
  Claude Opus 4.6 scaffholded by Claude Code on a box with a single H100
2. ↩︎
  The answers to those questions, as best I can tell, are “kinda”, “kinda”, and “not really” respectively. However, “as best I can tell” is “not very well”.
3. ↩︎
  Ideally not me, because the free time I can dedicate to this stuff is on the order of 3-5 hours / week and so if I’m the one making LeastAstonishmentBench I expect it’ll be good to go sometime in 2029