A detail that seems very important: are you running Opus 4.5? I would be less surprised if Opus can do this. Sonnet 4.5 seems to need more scaffolding. I have yet to succeed in giving a task it spends more than 20 minutes on, even with loop scaffolding. I’ve only got a few weeks of practice though.
Makes sense. I think Opus 4.5 is more coherent and is less weasily than Sonnet 4.5, which is what I typically use, for reasons(tm). Sonnet does not seem “reflexively stable”, not even close, and that’s what I try to address with the looping and invoking a fresh context to judge against the verification criteria. I’ll be honest, I don’t know how well it’s working. I don’t have any benchmarks, just vibes. But on vibes, it seems to help a bit.
A detail that seems very important: are you running Opus 4.5? I would be less surprised if Opus can do this. Sonnet 4.5 seems to need more scaffolding. I have yet to succeed in giving a task it spends more than 20 minutes on, even with loop scaffolding. I’ve only got a few weeks of practice though.
Yes, Opus 4.5.
Makes sense. I think Opus 4.5 is more coherent and is less weasily than Sonnet 4.5, which is what I typically use, for reasons(tm). Sonnet does not seem “reflexively stable”, not even close, and that’s what I try to address with the looping and invoking a fresh context to judge against the verification criteria. I’ll be honest, I don’t know how well it’s working. I don’t have any benchmarks, just vibes. But on vibes, it seems to help a bit.