For more context, the project in question was a fairly standard laravel (php) project which makes heavy use of dependency injection, which I’m looking at serving with swoole rather than apache. This involves moving from a shared-nothing architecture, where the framework is spun up in a fresh state for every request, to one where some things can persist from request to request. What I asked for was an echaustivd inventory of all places where this could lead to data leakage from one request to the next, with a failing functional test demonstrating the leakage in as many places as viable, and . I provided an example of a place with such a leakage (a container-bound singleton with persistent state), and instructions for how to run the test, and also an example of how to run arbitrary code within the framework context, and some examples of previously-built linting scripts which used reflection and ast parsing to identify problematic patterns programmatically.
Claude correctly identified that the concrete places one might find such state persisting across requests (global variables, static class variables, singleton services bound to the container, dependencies of said singletons, stateful database connections).
Initially Claude tried to identify potentially problematic patterns it could grep for, which turned up some examples. It then wrote two functional tests (which I later looked at and noticed were basically equivalent to assertFalse(true)), ran them to verify that they were failing, produced a report describing two “CRITICAL” issues with static class variables, declared that the application did not contain any instances of container-bound singletons with persistent state or state that could persist on a database connection between requests, and declared the task completed.
I told it that the task was not finished, and that I wanted an exhaustive list and that this would necessarily involve writing and executing code within the framework context, and, again, here are some examples of how to do this. I also flagged that its search for container-bound singletons with persistent state should have, at a minimum, caught the example of state leakage that I had given as an example.
Claude then proceeded to write a plausible-looking script to find examples of container-bound singletons, ran it, saved the results to a CSV, wrote a different plausible-looking script to find examples of services with non-DI instance variables, saved those results to a different csv, then used the csvjoin tool (which was cool, I didn’t realize that tool existed) to merge the two results into a third csv. That third CSV was empty, and Claude again declared that there were in fact no instances of that pattern and that it had successfully completed the task.
I mentioned, again, that it should at a minimum be turning up the example from the initial prompt. Claude went and spun its wheels for quite a while, and came back with several new plausible scripts (by this time it had dropped over 30 different scripts / md files / csvs in the repository root, with no organizational schema to speak of). After quite a few more rounds of this (Claude would run the tools but not sanity-check the outputs, so I had to sanity-check them to point to which tool wasn’t working right, at which point Claude would write an entirely new version of that tool in the repository root which wrote to a different file), Claude finally had a script which produced a complete list of the few hundred places which could have this pattern, and which would need to be checked individually to see what was happening and if there were any trends that would allow for pruning that list down further in a programmatic way. This was the point at which I decided to call it a day yesterday.
I might resume on Monday, but given how the beginning of the task went, I don’t have a lot of hope that Claude Code can finish the task at all.
To me the bottleneck mostly seemed to be that clopus trusted the tools it wrote way too much. When I write a tool like this, I tend to start with an output I know the tool will produce if it’s working correctly and a vague sense for how much output the tool will produce, write the first version of the tool, run it, and check the output to make sure it contains the thing I thought it should contain. Claude, on the other hand, seems to follow the algorithm “write tool. run tool. accept results of tool”.
Additionally, the product I want at the end of this is an exhaustive list of places to check, and a repeatable way to generate that list. Claude’s strength seems to be that it is unreasonably good at one-shotting things—once it has to consider the current outputs of a tool it wrote, the outputs it wants from that tool, and the way it needs to change that tool to get those new outputs, things seem to fall apart.
It looks like Ralph Wiggum is intended for a different use case (“Claude code is successfully completing tasks, but stopping before it finishes with all of them”).
For more context, the project in question was a fairly standard laravel (php) project which makes heavy use of dependency injection, which I’m looking at serving with swoole rather than apache. This involves moving from a shared-nothing architecture, where the framework is spun up in a fresh state for every request, to one where some things can persist from request to request. What I asked for was an echaustivd inventory of all places where this could lead to data leakage from one request to the next, with a failing functional test demonstrating the leakage in as many places as viable, and . I provided an example of a place with such a leakage (a container-bound singleton with persistent state), and instructions for how to run the test, and also an example of how to run arbitrary code within the framework context, and some examples of previously-built linting scripts which used reflection and ast parsing to identify problematic patterns programmatically.
Claude correctly identified that the concrete places one might find such state persisting across requests (global variables, static class variables, singleton services bound to the container, dependencies of said singletons, stateful database connections).
Initially Claude tried to identify potentially problematic patterns it could grep for, which turned up some examples. It then wrote two functional tests (which I later looked at and noticed were basically equivalent to
assertFalse(true)), ran them to verify that they were failing, produced a report describing two “CRITICAL” issues with static class variables, declared that the application did not contain any instances of container-bound singletons with persistent state or state that could persist on a database connection between requests, and declared the task completed.I told it that the task was not finished, and that I wanted an exhaustive list and that this would necessarily involve writing and executing code within the framework context, and, again, here are some examples of how to do this. I also flagged that its search for container-bound singletons with persistent state should have, at a minimum, caught the example of state leakage that I had given as an example.
Claude then proceeded to write a plausible-looking script to find examples of container-bound singletons, ran it, saved the results to a CSV, wrote a different plausible-looking script to find examples of services with non-DI instance variables, saved those results to a different csv, then used the
csvjointool (which was cool, I didn’t realize that tool existed) to merge the two results into a third csv. That third CSV was empty, and Claude again declared that there were in fact no instances of that pattern and that it had successfully completed the task.I mentioned, again, that it should at a minimum be turning up the example from the initial prompt. Claude went and spun its wheels for quite a while, and came back with several new plausible scripts (by this time it had dropped over 30 different scripts / md files / csvs in the repository root, with no organizational schema to speak of). After quite a few more rounds of this (Claude would run the tools but not sanity-check the outputs, so I had to sanity-check them to point to which tool wasn’t working right, at which point Claude would write an entirely new version of that tool in the repository root which wrote to a different file), Claude finally had a script which produced a complete list of the few hundred places which could have this pattern, and which would need to be checked individually to see what was happening and if there were any trends that would allow for pruning that list down further in a programmatic way. This was the point at which I decided to call it a day yesterday.
I might resume on Monday, but given how the beginning of the task went, I don’t have a lot of hope that Claude Code can finish the task at all.
To me the bottleneck mostly seemed to be that clopus trusted the tools it wrote way too much. When I write a tool like this, I tend to start with an output I know the tool will produce if it’s working correctly and a vague sense for how much output the tool will produce, write the first version of the tool, run it, and check the output to make sure it contains the thing I thought it should contain. Claude, on the other hand, seems to follow the algorithm “write tool. run tool. accept results of tool”.
Additionally, the product I want at the end of this is an exhaustive list of places to check, and a repeatable way to generate that list. Claude’s strength seems to be that it is unreasonably good at one-shotting things—once it has to consider the current outputs of a tool it wrote, the outputs it wants from that tool, and the way it needs to change that tool to get those new outputs, things seem to fall apart.
It looks like Ralph Wiggum is intended for a different use case (“Claude code is successfully completing tasks, but stopping before it finishes with all of them”).