Chastity Ruth comments on Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

Chastity Ruth 12 Jun 2025 4:12 UTC
3 points
−3
I might be wrong here, but I don’t think the below is correct?

”For River Crossing, there’s an even simpler explanation for the observed failure at n>6: the problem is mathematically impossible, as proven in the literature, e.g. see page 2 of this arxiv paper.”

That paper says n=>6 is impossible if the boat capacity is not 4. But the prompt in the Apple paper allows for the boat capacity to change.

”$N$ actors and their $N$ agents want to cross a river in a boat that is capable of holding
only $k$ people at a time...”
- LawrenceC 14 Jun 2025 5:07 UTC
  4 points
  0
  Parent
  The authors specify that they used k=3 on page 6 of their paper:
  If they used a different k (such that the problem is possible), they did not say so in their paper.
  You’re right that I could’ve been more clear here, I’ve edited in a footnote to clarify.
  - Chastity Ruth 15 Jun 2025 2:04 UTC
    3 points
    2
    Parent
    Ah, fair enough. I had skipped right to their appendix, which has confusing language around this:
    
    “(1) Boat Capacity Constraint: The boat can carry at most k individuals at a time, where k is typically
    set to 2 for smaller puzzles (N ≤ 3) and 3 for larger puzzles (N ≤ 5); (2) Non-Empty Boat Constraint:
    The boat cannot travel empty and must have at least one person aboard...”
    
    The “N ≤ 5″ here suggests that for N ≥ 6 they know they need to up the boat capacity. On the other hand, in the main body they’ve written what you’ve highlighted in the screenshot. Even in the appendix they write for larger puzzles that “3” is “typically” used. But it should be atypically – used only for 4 and 5 agent-actor pairs.
    
    Edit: That section can be found at the bottom of page 20
    - LawrenceC 15 Jun 2025 2:40 UTC
      3 points
      1
      Parent
      Super fair. I did not read that section in detail, and missed your interpretation on my skim. I interpreted it to mean “in the planning literature, we typically see k=3 for n<6” (which we do!), noted that they did not say the alternative value, and then went with the k=3 value they did say in the main body.
      You’re right that your interpretation is more natural. If they actually used k=4, then the problem is solvable and the paper was (a bit) better than I portrayed it to be here.
      Worth noting that, even assuming your interpretation is correct, it’s possible is that the person who wrote the problem spec did know about the impossibility result, but the person running the experiments did not (and thus the experiments were ran with k=3). But it does make it more likely that k=4 was used for n>5.
      I think my statement above that “they did not say” “a different value of k (such that the problem is possible)” still seems true. And if they used k=4 because they knew that k=3 is impossible, it seems quite sloppy to say that they used k=3 in the main body. But a poorly written paper is a different and lesser problem to the experiment being wrong.