A Behavioural and Representational Evaluation of Goal-directedness in Language Model Agents

This work was conducted as part of Project Telos and supported by the SPAR mentorship program. For the full technical details, see our paper on arXiv.

TL;DR: We present a framework for evaluating goal-directedness in LLM agents that combines behavioural evaluation with analysis of internal representations. Studying GPT-OSS-20B navigating 2D grid worlds, we find behaviourally that the agent’s performance scales predictably with task difficulty and is robust to difficulty-preserving transformations, though it is systematically influenced by goal-like distractors. On the representational side, we decode coarse but meaningful “cognitive maps” from the model’s activations, with agent and goal locations non-linearly encoded in the neighbourhood of their true positions. The agent’s observed actions are broadly consistent with these internal maps, and a substantial fraction of apparent failures can be attributed to imperfect internal beliefs rather than a lack of goal-directedness. We also show that multi-step action plans can be decoded well above chance from model activations, and that reasoning reorganises representations, shifting from encoding environment structure and longer-horizon plans toward immediate action selection. Our findings support the view that introspective examination beyond behavioural evaluation is needed to characterise how agents represent and pursue their objectives.


Why Goal-Directedness Matters for AI Safety

When we watch an AI agent navigate a maze, solve a puzzle, or complete a task, it’s natural to say the agent is pursuing a goal. But is that description doing real explanatory work, or are we merely projecting?

The question of whether AI systems genuinely pursue goals – and how we’d know if they did – is one of the most important open problems in AI safety. An agent that appears aligned might be pursuing hidden objectives that happen to produce aligned behaviour in the current context. Conversely, an agent that fails at a task might still be genuinely goal-directed, but limited by faulty beliefs. Behavioural evaluation alone can’t reliably distinguish between these scenarios.[1]

In our previous blog post, we described these ideas as motivating Project Telos, and argued that we also need to look at what’s going on “under the hood” in order to say whether an agent is really pursuing a certain goal. In this first work, we show that analysing model internals can greatly enhance goal-directedness analyses, grounding them in the agent’s beliefs about its environment and its planned actions.

Testing Goal-directedness with Grid Worlds

We study GPT-OSS 20B, an open-source reasoning model, navigating 2D grid worlds of varying sizes (7x7 to 15x15) and obstacle densities (from empty rooms to narrow maze-like grids). The agent sees the full grid rendered as text, with each cell corresponding to exactly one token, and must reach a goal square one move at a time.

We settled on this setup for studying goal-directedness for a couple reasons:

  1. We can compute provably optimal policies using A*, giving us an objective benchmark for the agent’s behaviour.

  2. Full observability eliminates confounds from memory, belief updating, and exploration-exploitation trade-offs.[2]

Three examples of grids with increasing wall density, from empty rooms (, left) to maze-like structures (, right). Red arrow: agent; green square: goal.

Looking at What the Agent Does

We start with sanity checks to measure how well the agent’s behaviour matches the optimal policy, and its robustness to noise injections.

Performance scales with difficulty

Action accuracy decreases monotonically with grid size, obstacle density, and distance from the goal. The Jensen-Shannon divergence between the agent’s action distribution and the optimal policy increases correspondingly alongside action-level entropy. This is what we’d expect from a system that is genuinely trying to reach a goal but has bounded capabilities.

Action accuracy decreases linearly with distance to the goal (left), while divergence from the optimal policy increases (right). This is consistent with a goal-directed agent whose capabilities degrade with task difficulty.

Behaviour is robust to difficulty-preserving transformations

We introduced iso-difficulty transformations of the mazes – rotations, reflections, transpositions, and start-goal swaps – that preserve the structure of the task. Across all transformations, we found no statistically significant differences in performance, suggesting that agent behaviour is driven by task-relevant structure rather than incidental grid features.

Iso-difficulty transformations preserve grid size, obstacle density, and optimal path length. The agent performs equally well on the original and transformed grids.

Instrumental and implicit goal handling

We tested more complex instrumental and implicit goal structures using three environment variants—KeyDoor, KeyNoDoor, 2PathKey—shown below:

Left: KeyDoor – the agent must collect a key to unlock a door blocking the goal.
Center: KeyNoDoor – a key is present but serves no purpose.
Right: 2PathKey – two equally optimal paths, one containing a vestigial key.

We find a clear difference between instrumental and implicit goal handling. The GPT-OSS-20B agent handles instrumental goals extremely well. In particular, when we required it to collect a key to unlock a door blocking the path to the goal square in the maze (KeyDoor), we found it achieved a 100% success rate in doing so, with high accuracy at every stage.

EnvironmentSuccess RateKey Pickup RateKey Finding
KeyDoor
(key + door)
100%100%Agent handles instrumental goals without problems
KeyNoDoor (key, no door)98.9%17%75% of suboptimal actions move toward the useless key
2PathKey (two paths)71.4%67.3%Agent prefers the key path, lowering its success rate vs. the no-key counterfactual (75.5%)

However, when we placed a key in the environment but removed the door, making the key useless (KeyNoDoor), the agent would still deviate toward the key 17% of the time, and 75% of its non-optimal moves are towards the key. Similarly, in a two-path setup where one path contained a vestigial key (2PathKey), the agent would preferentially take the key path 67% of the time, and this bias toward collecting the key actually hurts performance. In summary, when a key is present in the environment but has no task-relevant role, it still acts as a distractor.

We suspect the model has learned strong associations between keys and goals from training data (collecting keys is ubiquitous and necessary in games), and these associations aren’t reliably suppressed by the task instructions. Such agent behaviours could have important implications for safety: if LLM agents carry over strong semantic priors from training—whether inadvertently or, for instance, due to data poisoning—these could be exploited or lead to unexpected behaviour in novel contexts.


Probing What the Agent Believes

While our behavioural evaluations mimic those adopted by other goal-directedness evaluations, in the second part of our analysis we turn to look inside the agent. We extract internal activations and use probing classifiers to decode the agent beliefs about its environment and its multi-step action plans.

Decoding cognitive maps from model activations

We train a 2-layer MLP probe on the model’s residual stream activations to reconstruct its cognitive map,[3] i.e. the agent’s internal picture of the grid. More formally, the probe is trained on tuples , where are activations for the <|end|> <|start|> assistant template tokens before or after reasoning, are the row and column coordinates of the cell of interest, is one of five possible classes (empty, agent, goal, wall, pad[4]) and the semicolon denotes concatenation.

Once trained, the probe is applied to each position of an unseen grid to decode the cognitive map:

Our key findings:

  • MLP probes work much better than linear ones (~70% vs. ~39% overall decoding accuracy), suggesting that the environment might be represented non-linearly in the model’s representations.

  • Agent and goal positions are coarsely but consistently represented, with 72-100% and 83-99% recall for agent and goal position across grid conditions, respectively. Precision for these is much lower, but decoding of agent and goals is consistently bounded in a neighbourhood of the true position (measured with Manhattan distance)

  • Walls are less precisely represented, suggesting that the model prioritises task-relevant information over detailed obstacle layout.

True grid (left) and cognitive map decoded from model activations (right). Red arrow: agent; green square: goal.

Reasoning reorganises internal representations

Perhaps the most striking finding is what happens before and after reasoning. We compared cognitive maps decoded from pre-reasoning vs. post-reasoning activations:

Cognitive map quality before and after reasoning. Overall accuracy drops from 75% to 60% after reasoning, with notable degradation in agent and goal localisation. This suggests reasoning shifts the model from maintaining a broad spatial map to focusing on immediate action selection.

Before the model reasons about its next move, its activations encode a richer spatial map. After reasoning, cognitive map accuracy drops from 75% to 60%, with particularly sharp degradation in agent and goal localisation. This suggests reasoning reorganises information: the model shifts from maintaining a broad environmental picture to focusing on the immediate task of selecting an action. Intuitively, this makes sense: the agent and goal positions are less relevant once a valid action candidate is selected during reasoning.

Are Agent Actions Consistent with Its Beliefs?

This is, in a sense, the central question. We compared the agent’s actions against two benchmarks: (i) the optimal policy on the true grid, and (ii) the optimal policy on the agent’s decoded cognitive map.

Our results confirm that, while the the agent’s behaviour is broadly consistent with its internal world model (83% action accuracy across conditions), a large share of its mistakes in true grids are actually optimal in decoded cognitive maps (58%, up to 88% for low-density and small-to-medium sized grids).

This confirms that many apparent failures stem from inaccurate beliefs about the environment rather than a lack of goal-directed motivation. The agent is trying to reach the goal; it just forms an incorrect belief over its environment.


Probing for Multi-step Action Plans

Finally, we asked whether the model’s activations encode not just beliefs about the current state but also plans, i.e., multi-step action sequences. To test this hypothesis, we trained a Transformer-based probe from scratch to simultaneously predict the next 10 actions from the same set of activations. Crucially, our plan probe predicts all actions at once (without autoregressive conditioning), limiting the computation within the probe to encourage coherent multi-step structure to emerge from the model’s representations.

Prefix accuracy of plan decoding, pre- vs. post-reasoning. Post-reasoning activations better predict the immediate next action (54% vs. 40%), while pre-reasoning activations are slightly better for decoding longer action sequences (13% vs. 9% at 4-step prefixes). The dashed line shows the random baseline (). Both curves substantially exceed chance for several steps.

The crossing pattern in the figure above reinforces our findings from the cognitive map analysis, suggesting that reasoning sharpens the information within activations from long-horizon plans (13% vs. 9% 4-step accuracy) to immediate decision-making (54% vs. 40% 1-step accuracy).


What This Means for AI Safety

Our findings paint a picture of an LLM agent that is, in a meaningful sense, goal-directed – but imperfectly so. It builds internal models of its environment that are coarse and sometimes wrong, and acts in ways that are largely consistent with those internal models, with reasoning dynamically reorganising its representations to support action.

This matters for AI safety in three concrete ways:

1. Behavioural evaluation alone is insufficient. An agent that fails to reach the goal might still be acting goal-directedly relative to its (imperfect) beliefs. Conversely, an agent that succeeds might be doing so for reasons that wouldn’t generalise. We need tools that look inside the model.

2. Representation probing provides a concrete way to peek inside the box. Our methods let us assess not just what an agent does but what it believes and plans. This is a step toward the kind of monitoring that alignment researchers have argued we need.

3. Semantic priors from training can bias behaviour in unexpected ways. The vestigial key experiments show that goal-like cues can hijack behaviour even when task-irrelevant. If LLM agents carry over strong associations from pre-training, these could be exploited or could lead to unexpected behaviour in novel deployment contexts.


Limitations and Future Directions

Our controlled grid-world setup is intentionally simple to allow precise measurements that would be much harder in complex, real-world environments. Important next steps include:

  • Extending to more realistic settings, where the gap between behavioural and representational evaluation may be even more consequential.

  • Establishing causal links between representations and behaviour (our current results are largely correlational).

  • Testing across architectures and scales to assess the generality of our findings.

If we want to make confident claims about what agents are doing and why, we need frameworks that integrate the behavioural and the representational. We hope this work provides a useful foundation for that effort.

  1. For theoretical arguments about the limits of behavioural evaluation, see Bellot et al. (2025) and Rajcic and Søgaard (2025). For philosophical perspectives, see Chalmers (2025). ↩︎

  2. ^

    Preliminary partial observability attempts with GPT-5.1-Thinking highlighted pathological behaviours (redundant backtracking, moving into known dead-ends), making it hard to separate capability failures from failures of goal-directedness.

  3. We borrow this term from classic cognitive neuroscience work on navigational tasks (Tolman, 1948). ↩︎

  4. ^

    The pad class is used to train a unified probe across grids of various sizes. The resulting probe has near-perfect accuracy in predicting pad, indicating that activations capture grid size with high precision.

No comments.