What sort of things do you solve with this? I feel like when I have a problem that’s not fairly easy for an AI to solve straightforwardly, if I sent it on a loop it’d just do a bunch of random crazy shit that was clearly not the right solution.
I can imagine a bunch of scaffolding that helps but don’t it seems like most of the work is in the problem specification and I’m not sure if I don’t have the sort of problems that benefit from this or if skill issue.
You need a clear measure. For example, let’s say you want to build a scripted bot that can play a novel game for which there is not an off the self solution. You could try to train a neural net, but Claude can write code, so you fill in Y with “writing a bot that plays game Z”.
This sort of strategy is obviously heavily dependent of the availability of a good evaluation method and a clear scoring mechanism. As such, it doesn’t work for most problems, since most problems don’t have such large search spaces.
Yeah I get the principle, but, like, what in practice do you do where this is useful? Like concrete (even if slightly abstracted) examples of things you did with it.
Well, as I say in my example above, literally build a bot that plays a game.
Most of the loops end up much shorter, though, like “upgrade this package dependency, keep fixing bugs in the build until the build passes”, but sometimes these changes are kinda weird, so I try to get Claude to do what a human would do, which is keep trying things it thinks might work to get the build to pass.
Or, one I haven’t done but might: keep adding tests until we hit X% coverage (and give some examples of what constitutes a good test). This one I expect to work better than you might think, since Opus is getting reasonably good at not specification gaming and trying to actually do what I mean, which Sonnet frequently still goes for.
Gotcha. Was the game one real for you? (I guess I’m looking for things that will show up in my day job, and trying to get a sense of whether people have different day-jobs than me, or doing random side projects, or what)
What sort of things do you solve with this? I feel like when I have a problem that’s not fairly easy for an AI to solve straightforwardly, if I sent it on a loop it’d just do a bunch of random crazy shit that was clearly not the right solution.
I can imagine a bunch of scaffolding that helps but don’t it seems like most of the work is in the problem specification and I’m not sure if I don’t have the sort of problems that benefit from this or if skill issue.
You need a clear measure. For example, let’s say you want to build a scripted bot that can play a novel game for which there is not an off the self solution. You could try to train a neural net, but Claude can write code, so you fill in Y with “writing a bot that plays game Z”.
This sort of strategy is obviously heavily dependent of the availability of a good evaluation method and a clear scoring mechanism. As such, it doesn’t work for most problems, since most problems don’t have such large search spaces.
Yeah I get the principle, but, like, what in practice do you do where this is useful? Like concrete (even if slightly abstracted) examples of things you did with it.
Well, as I say in my example above, literally build a bot that plays a game.
Most of the loops end up much shorter, though, like “upgrade this package dependency, keep fixing bugs in the build until the build passes”, but sometimes these changes are kinda weird, so I try to get Claude to do what a human would do, which is keep trying things it thinks might work to get the build to pass.
Or, one I haven’t done but might: keep adding tests until we hit X% coverage (and give some examples of what constitutes a good test). This one I expect to work better than you might think, since Opus is getting reasonably good at not specification gaming and trying to actually do what I mean, which Sonnet frequently still goes for.
Gotcha. Was the game one real for you? (I guess I’m looking for things that will show up in my day job, and trying to get a sense of whether people have different day-jobs than me, or doing random side projects, or what)
The test-coverage one is interesting.
Yes. Specifically I was building agents to play games as part of a beta with SoftMax.