One might expect that an LLM’s success on a task depends on how many related instances of the task are in its training data (I certainly thought that for a while). However, for every div 2. A problem there is a div 2. E problem and a tutorial to go along with it and Claude 3.7 has no problem writing div 2. A problems but struggles writing div 2. E problems.
You seem to be hinting that there are as many div 2. E-difficulty problems as there are div 2. A-difficulty problems in Claude’s training data. While this might be approximately true for codeforces, it is certainly not true for the wider internet. The “Balanced Delivery Routes” problem hasappearedmanytimesoutside of codeforces.
I suspect Claude vaguely remembered that such a problem is nice and solvable, and then spent its effort on changing the story instead of the core problem.
When you pressed it to make a harder problems, it took a classical problem: finding the path with largest minimum edge weight, and used several modifiers to make it harder:
k-th minimum instead of just minimum,
Multiple queries,
Updates between queries.
However, Claude did not properly verify that these modifiers kept the problem solvable, so it went way overboard. Even the sub-problem of checking if there exists a simple path through a given edge is fairly involved, so I strongly doubt that the full problem is solvable with the given constraints.
Problem authors typically perform many iterations of coming up with problem candidates and checking if they are solvable, before finding a good problem. I don’t think Claude can autonomously complete this loop yet, especially for hard problems.
You seem to be hinting that there are as many div 2. E-difficulty problems as there are div 2. A-difficulty problems in Claude’s training data. While this might be approximately true for codeforces, it is certainly not true for the wider internet. The “Balanced Delivery Routes” problem has appeared many times outside of codeforces.
I suspect Claude vaguely remembered that such a problem is nice and solvable, and then spent its effort on changing the story instead of the core problem.
When you pressed it to make a harder problems, it took a classical problem: finding the path with largest minimum edge weight, and used several modifiers to make it harder:
k-th minimum instead of just minimum,
Multiple queries,
Updates between queries.
However, Claude did not properly verify that these modifiers kept the problem solvable, so it went way overboard. Even the sub-problem of checking if there exists a simple path through a given edge is fairly involved, so I strongly doubt that the full problem is solvable with the given constraints.
Problem authors typically perform many iterations of coming up with problem candidates and checking if they are solvable, before finding a good problem. I don’t think Claude can autonomously complete this loop yet, especially for hard problems.