The DEU experiment described here (https://www.lesswrong.com/posts/issGLfCGz3TGcPKGH/deliberate-epistemic-uncertainty-an-automated-experiment-on) was mostly done by Claude Code (Opus 4.6), with me providing the research direction and critical review. I described my idea (Deliberate Epistemic Uncertainty), and Claude simplified it into a testable 2x2 experiment, designed the protocol, wrote the code, generated the inputs, ran 400 API calls, evaluated results using 8 parallel sub-agents, and produced the qualitative analysis — all in roughly a day of wall-clock time with about 2 hours of my input. I want to flag three things about the process that seem independently interesting:
Sub-agents instead of API calls for evaluation. Rather than writing a script that sends hundreds of LLM-as-judge API calls with short prompts, we used Claude Code sub-agents — full agent sessions that first read the experiment design, understand what counts as “detection,” and then evaluate a batch of 50 responses with that context. This is slower and more expensive than API calls, but the quality is noticeably higher: one sub-agent even caught a data parsing bug that I later confirmed. The tradeoff is worth it when evaluation requires judgment, not just pattern matching.
Structured memory to survive context loss. Claude Code sessions hit context limits and “compact” (summarize) the conversation, losing detail. We worked around this by writing all substantive discussion to LOG files and keeping SUMMARY files synchronized, so a fresh agent can pick up where the last one left off without hallucinating the history. This is more infrastructure work than you’d expect, but it’s what made the multi-session research workflow possible. Without it, each new session would have started from scratch.
Where human input was critical. The AI’s blind spots were real: it initially wrote evaluation code when the agreed design called for sub-agents (reverting to default habits despite documented decisions), and only 1 of 8 evaluation sub-agents flagged a data quality issue that affected 68% of runs. My contributions were catching these errors, making design decisions (which models, how many nudges, evaluation approach), and pushing back on the framing. Neither of us could have produced the result alone — but the ratio of human effort to output was striking.
The DEU experiment described here (https://www.lesswrong.com/posts/issGLfCGz3TGcPKGH/deliberate-epistemic-uncertainty-an-automated-experiment-on) was mostly done by Claude Code (Opus 4.6), with me providing the research direction and critical review. I described my idea (Deliberate Epistemic Uncertainty), and Claude simplified it into a testable 2x2 experiment, designed the protocol, wrote the code, generated the inputs, ran 400 API calls, evaluated results using 8 parallel sub-agents, and produced the qualitative analysis — all in roughly a day of wall-clock time with about 2 hours of my input. I want to flag three things about the process that seem independently interesting:
Sub-agents instead of API calls for evaluation. Rather than writing a script that sends hundreds of LLM-as-judge API calls with short prompts, we used Claude Code sub-agents — full agent sessions that first read the experiment design, understand what counts as “detection,” and then evaluate a batch of 50 responses with that context. This is slower and more expensive than API calls, but the quality is noticeably higher: one sub-agent even caught a data parsing bug that I later confirmed. The tradeoff is worth it when evaluation requires judgment, not just pattern matching.
Structured memory to survive context loss. Claude Code sessions hit context limits and “compact” (summarize) the conversation, losing detail. We worked around this by writing all substantive discussion to LOG files and keeping SUMMARY files synchronized, so a fresh agent can pick up where the last one left off without hallucinating the history. This is more infrastructure work than you’d expect, but it’s what made the multi-session research workflow possible. Without it, each new session would have started from scratch.
Where human input was critical. The AI’s blind spots were real: it initially wrote evaluation code when the agreed design called for sub-agents (reverting to default habits despite documented decisions), and only 1 of 8 evaluation sub-agents flagged a data quality issue that affected 68% of runs. My contributions were catching these errors, making design decisions (which models, how many nudges, evaluation approach), and pushing back on the framing. Neither of us could have produced the result alone — but the ratio of human effort to output was striking.