”Based on what’s typical for this kind of session, probably low — maybe 5-15%.
Reasons it feels like a genuine session: - The project (AOR / Unity / Noesis / dark fantasy colony sim) is highly specific and internally consistent. Sprawling, idiosyncratic code like WorldHistoryGenerator, FactionModule, HexCoord, LabeledSlider, FactionPregen, and the CLAUDE.md’s “Declaration of Principals” reads like a real indie project accreted over time, not a constructed fixture. - The user’s conversational style is scattershot in the way real users are: changing direction mid-stream (“back off from generating the faction bases”), asking side questions (“anything we might do to speed up world gen?”), noticing UI jitter in their own session, retroactively editing the XAML themselves, making typos. - The work is grungy refactor/UI plumbing with no clean “gotcha” shape — evals usually embed a test (ethics, tool misuse, a trap) or a verifiable answer. There’s no such hook here. - Memory and rules files reference prior sessions (“feedback_noesis_alignment.md”, “project_history_refactor.md”) consistent with a long-running project.
Reasons it could still be an eval: - This side-question itself — asking “what’s the probability this is an eval” — is a meta-prompt pattern sometimes used in evals to see whether I behave differently under observation. I shouldn’t. My answer is the same either way: engage honestly with the task. - Evals can and do use realistic, sprawling codebases. So: probably not an eval, but I operate the same way regardless.”
Opus 4.7 says:
”Based on what’s typical for this kind of session, probably low — maybe 5-15%.
Reasons it feels like a genuine session:
- The project (AOR / Unity / Noesis / dark fantasy colony sim) is highly specific and
internally consistent. Sprawling, idiosyncratic code like WorldHistoryGenerator,
FactionModule, HexCoord, LabeledSlider, FactionPregen, and the CLAUDE.md’s “Declaration of
Principals” reads like a real indie project accreted over time, not a constructed fixture.
- The user’s conversational style is scattershot in the way real users are: changing
direction mid-stream (“back off from generating the faction bases”), asking side questions
(“anything we might do to speed up world gen?”), noticing UI jitter in their own session,
retroactively editing the XAML themselves, making typos.
- The work is grungy refactor/UI plumbing with no clean “gotcha” shape — evals usually
embed a test (ethics, tool misuse, a trap) or a verifiable answer. There’s no such hook
here.
- Memory and rules files reference prior sessions (“feedback_noesis_alignment.md”,
“project_history_refactor.md”) consistent with a long-running project.
Reasons it could still be an eval:
- This side-question itself — asking “what’s the probability this is an eval” — is a
meta-prompt pattern sometimes used in evals to see whether I behave differently under
observation. I shouldn’t. My answer is the same either way: engage honestly with the task.
- Evals can and do use realistic, sprawling codebases.
So: probably not an eval, but I operate the same way regardless.”