Very reasonable.
Interesting challenge, looking forward to the eventual reveal!
chrisjbillington
Any suggestions on how I might validate the answers Claude gives, so that I don’t just waste your time sending a bunch of incorrect attempts?
Here is an attempt using an approach that worked well for things like substitution-ciphers-with-errors—Claude would overthink it and confuse itself, whereas encouraging it to act purely on instinct worked well, and telling it to repeat the question four times allowed it some gut-level thinking space without the kind of structured thinking that led it astray:
https://claude.ai/share/f00ed43b-e26b-4e98-b4f7-9ad77100fac0
(I have zero clue if this is remotely correct)
Here’s an example of the buggy substitution cipher that this approach worked well with, as far back as Sonnet 3.5 (“new”):
https://x.com/Chrisbilbo/status/1884004589453848945
I wonder if one reason SMTM might be advocating improbable theories, and not accepting bets, is because they are intentionally persuing improbable theories. Their post on scurvy seems to make the point that you need to check things even if they seem improbable, as the truth sometimes turns out to be something that seemed improbable.
I (perhaps charitably) assumed they did not believe the lithium theory per se, but thought it was worth a more detailed look—having previously argued that the bar for that should be lower than others think.
I thought the lithium theory and the potato diet were just two of many possible things they might be looking into, with the idea being that they advocate a broader search generally.
Exactly what I was thinking.
Opus 4.7 being more likely to change tactics when perceiving itself to be stuck was both a strength and a weakness—Opus 4.7 would frequently abandon dungeons upon encountering difficulty, and go do something else, whether useful or not. When this applied to tactics within a puzzle it was an asset, but when it meant (temporarily) abandoning a puzzle that was in fact self-contained, it slowed Claude down.
For example Claude had some difficulty in Rock Tunnel, and so went to obtain the FLASH HM. This HM is ordinarily useful for humans to light up the dark tunnel, but not necessary for Claude since the harnass already highlights navigable tiles in the tunnel. So that just consumed time—Claude’s difficulty in the tunnel was no different to previous models and persistence would have been better.
Claude also went back and forth between puzzles/dungeons that it was capable of solving individually, just because upon (inevitably) encountering difficulty it was inclined to go do something else. So there was wasted travel time as it completed tasks across the game in parallel, switching between them, instead of serially.
I’m unsure how the plot of step counts shown in the post has interpreted the step data in order to make progress appear monotonic, since the game doesn’t impose a fixed order on many of these milestones. Plotting the raw step data for when Claude achieved each task (and using the first time Claude entered each dungeon as the point plotted), you can see the increased task-switching vs other runs: there’s a lot more “enter dungeon X” followed by one or more others tasks before “exit dungeon X”.
The huge number of steps saved by Opus 4.7 being able to queue up larger sequences of button presses in a single turn vs previous runs makes the step counts significantly less comparable, IMHO. The fact that Opus 4.7 could beat the game at all shows unambiguous improvement, but I think if you counted button presses instead of harness iterations it likely would show the model took more button presses to progress than 4.6.
Of course, the dev presumably allowed more button presses in a single turn because the model is better able to make use of this than previous models, so that’s not to say previous models would have been faster with more button presses allowed in a single turn—they may simply have used this power to make more mistakes and slow themselves down further.