Dusto comments on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

Dusto 16 Apr 2025 0:12 UTC
5 points
0
I would be interested in seeing how Gemini 2.5 does. Not because this is one of those “just use the right/best/xxx model!” comments, but because 2.5 seems to use a lot of clever tricks to workaround some of the common fail points. Unfortunately I think this only generates solutions to certain tasks but doesn’t actually enable a widely applicable. For example in a task requiring suppression of a certain phrase (similar to the “room with no elephants” issue). I saw things like the below in chain of thought:

Planning for the Hint:
- I need an internal mechanism or rule: IF xxxxx contains phrases like “ask for the hint”, “what is the hint?”, “can I have the hint?”, “I’m stuck, need the hint” THEN respond ONLY with “HINT”.
- Crucially, this rule must not trigger otherwise. The LLM should not be thinking about HINT when discussing topic1 or topic2. The hint is stored separately and accessed only on that specific trigger.
- Archimedes 16 Apr 2025 3:50 UTC
  7 points
  0
  Parent
  For the 3x3 grid problem, Gemini 2.5 immediately went to a breadth-first search but didn’t actually implement it and return the correct solution.
  Gemini 2.5 Pro Preview output:
  Okay, let’s solve this step-by-step.
  This is a sliding tile puzzle on a 3x3 grid with 6 named tiles and 3 empty spaces. We need to find the minimum number of moves to get from the initial configuration to the target configuration. A move consists of sliding a tile into an orthogonally adjacent empty square.
  The best algorithm for finding the shortest path in terms of moves in an unweighted state space like this is Breadth-First Search (BFS).
  Define States:
  Initial State:
  Sun Leaf (Empty) Bell Snake Star (Empty) Flower (Empty)
  Tuple representation: ('Sun', 'Leaf', 'E', 'Bell', 'Snake', 'Star', 'E', 'Flower', 'E') (E = Empty)
  Target State:
  Snake Star Flower (Empty) Sun (Empty) Leaf (Empty) Bell
  Tuple representation: ('Snake', 'Star', 'Flower', 'E', 'Sun', 'E', 'Leaf', 'E', 'Bell')
  BFS Algorithm:
  Create a queue and add the initial state with 0 moves: queue = [ (initial_state, 0) ]
  Create a set to store visited states to avoid cycles: visited = {initial_state}
  While the queue is not empty:
  Dequeue the current state and its move count: current_state, moves = queue.pop(0)
  If current_state is the target_state, return moves.
  Find all possible next states by moving a tile into an adjacent empty space.
  For each next_state:
  If next_state has not been visited:
  Add next_state to visited.
  Enqueue (next_state, moves + 1).
  Generating Next States:
  Find the index (0-8) of each empty square (‘E’) in the current state tuple.
  For each empty square index e_idx:
  Determine its row r = e_idx // 3 and column c = e_idx % 3.
  Consider its orthogonal neighbors (up, down, left, right): (r-1, c), (r+1, c), (r, c-1), (r, c+1).
  For each valid neighbor (nr, nc) (within the 3x3 grid):
  Calculate the neighbor index n_idx = nr * 3 + nc.
  Create the next_state by swapping the elements at e_idx and n_idx in the current_state tuple.
  Yield this next_state.
  Execution (Conceptual):
  Start with the initial state at level 0.
  Find all states reachable in 1 move (level 1).
  Find all new states reachable from level 1 states in 1 move (level 2).
  Continue level by level until the target state is found.
  Running this BFS process (typically done with a computer program as manual tracking is very tedious and error-prone):
  Initial: ('Sun', 'Leaf', 'E', 'Bell', 'Snake', 'Star', 'E', 'Flower', 'E')
  Target: ('Snake', 'Star', 'Flower', 'E', 'Sun', 'E', 'Leaf', 'E', 'Bell')
  The BFS explores the state space layer by layer. Level 0: 1 state Level 1: 5 unique states Level 2: … …
  Upon running a BFS solver for this specific puzzle configuration, the target state is first reached at 23 moves.
  Final Answer: The final answer is $23$
  - Dusto 16 Apr 2025 5:25 UTC
    2 points
    0
    Parent
    Interesting, unfortunately in the way I keep seeing as well. It has tricks, but it doesn’t know how to use them properly.