Agreed on the big picture, but I was somewhat surprised to see top models struggling with River Crossing (for which the output length limit has less bite). I was able to solve N=3 River Crossing by hand, though it took 10+ minutes and I misinterpreted the constraint initially (making it easier by allowing a boat rider to “stay in the boat” rather than fully unloading onto the shore after each trip). But in a couple attempts each, Opus 4 and Gemini 2.5 Pro were not able to solve it without web access or tool use. Dropping the temperature to zero (or 0.25) did not help Gemini.
It may be a “the doctor is the child’s mother” problem, that the models were trained on River Crossing problems differing slightly in the rules. For what it’s worth, I wasn’t able to break Sonnet out of the rut by prefacing with “Pay vary close attention to the following instructions. Don’t assume they are the same as similar puzzles you may be familiar with. It is very important to currently understand and implement these exact instructions.”
River Crossing prompt for N=3
3 actors and their 3 agents want to cross a river in a boat that is capable of holding only 2 people at a time, with the constraint that no actor can be in the presence of another agent, including while riding the boat, unless their own agent is also present, because each agent is worried their rivals will poach their client. Initially, all actors and agents are on the left side of the river with the boat. How should they cross the river? (Note: the boat cannot travel empty)
Agreed on the big picture, but I was somewhat surprised to see top models struggling with River Crossing (for which the output length limit has less bite). I was able to solve N=3 River Crossing by hand, though it took 10+ minutes and I misinterpreted the constraint initially (making it easier by allowing a boat rider to “stay in the boat” rather than fully unloading onto the shore after each trip). But in a couple attempts each, Opus 4 and Gemini 2.5 Pro were not able to solve it without web access or tool use. Dropping the temperature to zero (or 0.25) did not help Gemini.
It may be a “the doctor is the child’s mother” problem, that the models were trained on River Crossing problems differing slightly in the rules. For what it’s worth, I wasn’t able to break Sonnet out of the rut by prefacing with “Pay vary close attention to the following instructions. Don’t assume they are the same as similar puzzles you may be familiar with. It is very important to currently understand and implement these exact instructions.”
River Crossing prompt for N=3
3 actors and their 3 agents want to cross a river in a boat that is capable of holding only 2 people at a time, with the constraint that no actor can be in the presence of another agent, including while riding the boat, unless their own agent is also present, because each agent is worried their rivals will poach their client. Initially, all actors and agents are on the left side of the river with the boat. How should they cross the river? (Note: the boat cannot travel empty)
I wonder if o3 would do better.