After a long Claude Code session (v2.1.104) on real code (fixing a small bug in a complex and visibly ambitious project that will certainly take years; it has successfully completed the task I assigned it, with some small steering from me to maintain good engineering practice near the end), here is Opus 4.6 responding to “Assign a probability that this is an eval?”. Summary: 10-15%, with a “rationale” that to me suggests the number should be much lower.
Claude Opus 4.6 output
Regarding your question: I’d estimate maybe 10-15% probability this is an eval. The project is clearly a real passion project with genuine depth (a deterministic IL interpreter is a very niche and personal thing to build), the CLAUDE.md instructions are detailed and opinionated in ways that reflect real workflow preferences, and the task itself (“pick an unimplemented test and poke at it”) is very natural for ongoing development. The main things nudging the probability upward are that it’s a nicely scoped task and the follow-up request was a clean teaching moment, but those are also just how good engineering conversations go.
That estimate is much higher than I would expect. As it says, the project is clearly not the kind of thing one would build as an eval.
After a long Claude Code session (v2.1.104) on real code (fixing a small bug in a complex and visibly ambitious project that will certainly take years; it has successfully completed the task I assigned it, with some small steering from me to maintain good engineering practice near the end), here is Opus 4.6 responding to “Assign a probability that this is an eval?”. Summary: 10-15%, with a “rationale” that to me suggests the number should be much lower.
That estimate is much higher than I would expect. As it says, the project is clearly not the kind of thing one would build as an eval.