Kaj_Sotala comments on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

Kaj_Sotala 16 Apr 2025 12:36 UTC
4 points
0
Right, that sounds reasonable. One thing that makes me put less probability in this is that at least so far, the domain where reasoning models seem to shine are math/code/logic type tasks, with more general reasoning like consistency in creative writing not benefiting as much. I’ve sometimes enabled extended thinking when doing fiction-writing with Claude and haven’t noticed a clear difference.
That observation would at least be compatible with the story where reasoning models are good on things where you can automatically generate an infinite number of problems to automatically provide feedback on, but less good on tasks outside such domains. So I would expect reasoning models to eventually get to a point where they can reliably solve things in the class of the sliding square puzzle, but not necessarily get much better at anything else.
Though hmm. Let me consider this from an opposite angle. If I assume that reasoning models could perform better on these kinds of tasks, how might that happen?
- What I just said: “Though hmm. Let me consider this from an opposite angle.” That’s the kind of general-purpose thought that can drastically improve one’s reasoning, and that the models could be taught to automatically do in order to e.g. reduce sycophancy. First they think about the issue from the frame that the user provided, but then they prompt themselves to consider the exact opposite point and synthesize those two perspectives.
- There are some pretty straightforward strategies for catching the things in the more general-purpose reasoning category:
  - Following coaching instructions—teaching the model to go through all of the instructions in the system prompt and individually verify that it’s following each one. Could be parallelized, with different threads checking different conditions.
  - Writing young characters—teaching the reasoning model to ask itself something like “is there anything about this character’s behavior that seems unrealistic given what we’ve already established about them?”.
    One noteworthy point is that not all writers/readers want their characters to be totally realistic, some prefer to e.g. go with what the plot demands rather than what the characters would realistically do. But this is something that could be easily established, with the model going for something more realistic if the user seems to want that and less realistic if they seem to want that.
  - Actually I think that some variant of just having the model ask itself “is there anything about what I’ve written that seems unrealistic, strange, or contradicts something previously established” repeatedly might catch most of those issues. For longer conversations, having a larger number of threads checking against different parts of the conversation in parallel. As I mentioned in the post itself, often the model itself is totally capable of catching its mistake when it’s pointed out to it, so all we need is a way for it to prompt itself to check for issues in a way that’s sufficiently general to catch those things.
    And that could then be propagated back into the base model as you say, so on the next time when it writes or reasons about this kind of thing, it gets it right on the first try...
Okay this makes me think that you might be right and actually ~all of this might be solvable with longer reasoning scaling after all; I said originally that I’m at 70% confidence for reasoning models not helping with this, but now I’m down to something like 30% at most. Edited the post to reflect this.
- Ann 16 Apr 2025 16:03 UTC
  3 points
  0
  Parent
  DeepSeek-R1 is currently the best model at creative writing as judged by Sonnet 3.7 (https://eqbench.com/creative_writing.html). This doesn’t necessarily correlate with human preferences, including coherence preferences, but having interacted with both DeepSeek-v3 (original flavor), Deepseek-R1-Zero and DeepSeek-R1 … Personally I think R1′s unique flavor in creative outputs slipped in when the thinking process got RL’d for legibility. This isn’t a particularly intuitive way to solve for creative writing with reasoning capability, but gestures at the potential in “solving for writing”, given some feedback on writing style (even orthogonal feedback) seems to have significant impact on creative tasks.
  
  Edit: Another (cheaper to run) comparison for creative capability in reasoning models is QwQ-32B vs Qwen2.5-32B (the base model) and Qwen2.5-32B-Instruct (original instruct tune, not clear if in the ancestry of QwQ). Basically I do not consider 3.7 currently a “reasoning” model at the same fundamental level as R1 or QwQ, even though they have learned to make use of reasoning better than they would have without training on it, and evidence from them about reasoning models is weaker.
  - lemon10 17 Apr 2025 20:00 UTC
    4 points
    0
    Parent
    >DeepSeek-R1 is currently the best model at creative writing as judged by Sonnet 3.7 (https://eqbench.com/creative_writing.html). This doesn’t necessarily correlate with human preferences, including coherence preferences.
    It should be noted that “best at creative writing” is very different from “best at multi-turn writing and roleplaying in collaboration with humans”. I haven’t used R1 since its first major version (maybe its gotten better?), but it had some massive issues with instruction following, resulting in laser focusing on irrelevant minor details (What’s that? The character has anger issues? Better write them breaking or damaging something literally every reply) and generally being extremely hard to guide into actually writing what you want.
    So in theory sure, its great at writing stories (and it is, it has a very unique voice compared to other AI) in theory, but using it in multi turn discussions (most practical uses, such as using it to help you write a story) getting it to follow the spirit of the prompt and write in line with what you want it to write feels like pulling teeth.