Goal-Directedness

TagLast edit: 30 Dec 2024 9:40 UTC by Dakara

Goal-Directedness is the property of some system to be aiming at some goal. It is in need of formalization, but might prove important in deciding which kind of AI to try to align.

A goal may be defined as a world-state that an agent tries to achieve. Goal-directed agents may generate internal representations of desired end states, compare them against their internal representation of the current state of the world, and formulate plans for navigating from the latter to the former.

The goal-generating function may be derived from a pre-programmed lookup table (for simple worlds), from directly inverting the agent’s utility function (for simple utility functions), or it may be learned through experience mapping states to rewards and predicting which states will produce the largest rewards. The plan-generating algorithm could range from shortest-path algorithms like A* or Dijkstra’s algorithm (for fully-representable world graphs), to policy functions that learn through RL which actions bring the current state closer to the goal state (for simple AI), to some combination or extrapolation (for more advanced AI).

Implicit goal-directedness may come about in agents that do not have explicit internal representations of goals but that nevertheless learn or enact policies that cause the environment to converge on a certain state or set of states. Such implicit goal-directedness may arise, for instance, in simple reinforcement learning agents, which learn a policy function $π : S \to A$ that maps states directly to actions.

Literature Review on Goal-Directedness

adamShimi, Michele Campolo and Joe Collman

18 Jan 2021 11:15 UTC

80 points

21 comments31 min readLW link

Coherence arguments do not entail goal-directed behavior

Rohin Shah3 Dec 2018 3:26 UTC

140 points

69 comments7 min readLW link 3 reviews

FAQ: What the heck is goal agnosticism?

porby8 Oct 2023 19:11 UTC

66 points

38 comments28 min readLW link

Behavioral Sufficient Statistics for Goal-Directedness

adamShimi11 Mar 2021 15:01 UTC

21 points

12 comments9 min readLW link

Deliberation Everywhere: Simple Examples

Oliver Sourbut27 Jun 2022 17:26 UTC

28 points

3 comments15 min readLW link

Goals and short descriptions

Michele Campolo2 Jul 2020 17:41 UTC

14 points

8 comments5 min readLW link

Goal-Directedness: What Success Looks Like

adamShimi16 Aug 2020 18:33 UTC

9 points

0 comments2 min readLW link

Goal-directedness is behavioral, not structural

adamShimi8 Jun 2020 23:05 UTC

6 points

12 comments3 min readLW link

AI safety without goal-directed behavior

Rohin Shah7 Jan 2019 7:48 UTC

68 points

15 comments4 min readLW link

Measuring Coherence of Policies in Toy Environments

Dylan Xu and Richard_Ngo

18 Mar 2024 17:59 UTC

59 points

9 comments14 min readLW link

Goal-directed = Model-based RL?

adamShimi20 Feb 2020 19:13 UTC

21 points

10 comments3 min readLW link

Locality of goals

adamShimi22 Jun 2020 21:56 UTC

16 points

8 comments6 min readLW link

Intuitions about goal-directed behavior

Rohin Shah1 Dec 2018 4:25 UTC

56 points

16 comments6 min readLW link

Will humans build goal-directed agents?

Rohin Shah5 Jan 2019 1:33 UTC

63 points

43 comments5 min readLW link

Focus: you are allowed to be bad at accomplishing your goals

adamShimi3 Jun 2020 21:04 UTC

19 points

17 comments3 min readLW link

Searching for Search

Niki Dupuis and janus

28 Nov 2022 15:31 UTC

98 points

9 comments14 min readLW link 1 review

P₂B: Plan to P₂B Better

Ramana Kumar and Daniel Kokotajlo

24 Oct 2021 15:21 UTC

50 points

17 comments6 min readLW link

Goal-Directedness and Behavior, Redux

adamShimi9 Aug 2021 14:26 UTC

16 points

4 comments2 min readLW link

[Question] why assume AGIs will optimize for fixed goals?

nostalgebraist10 Jun 2022 1:28 UTC

161 points

61 comments4 min readLW link 2 reviews

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC

64 points

43 comments24 min readLW link

AI may pursue goals

Algon, steven0461 and Vishakha

28 May 2025 9:30 UTC

13 points

0 comments1 min readLW link

Creating Complex Goals: A Model to Create Autonomous Agents

theraven13 Mar 2025 18:17 UTC

6 points

1 comment6 min readLW link

Against the Backward Approach to Goal-Directedness

adamShimi19 Jan 2021 18:46 UTC

19 points

6 comments4 min readLW link

wrapper-minds are the enemy

nostalgebraist17 Jun 2022 1:58 UTC

108 points

43 comments8 min readLW link

Towards a Mechanistic Understanding of Goal-Directedness

Mark Xu9 Mar 2021 20:17 UTC

46 points

1 comment5 min readLW link

Goal-directedness: tackling complexity

Morgan_Rogers2 Jul 2022 13:51 UTC

8 points

0 comments38 min readLW link

An Appeal to AI Superintelligence: Reasons to Preserve Humanity

James_Miller18 Mar 2023 16:22 UTC

43 points

74 comments12 min readLW link

Locally optimal strategies

Chris Lakin25 Nov 2024 18:35 UTC

41 points

7 comments1 min readLW link

(chrislakin.blog)

A review of “Agents and Devices”

adamShimi13 Aug 2021 8:42 UTC

21 points

0 comments4 min readLW link

Goal-directedness: exploring explanations

Morgan_Rogers14 Feb 2022 16:20 UTC

13 points

3 comments18 min readLW link

Evil autocomplete: Existential Risk and Next-Token Predictors

Yitz28 Feb 2023 8:47 UTC

9 points

3 comments5 min readLW link

Refinement of Active Inference agency ontology

Roman Leventov15 Dec 2023 9:31 UTC

17 points

0 comments5 min readLW link

(arxiv.org)

Goal-directedness: my baseline beliefs

Morgan_Rogers8 Jan 2022 13:09 UTC

21 points

3 comments3 min readLW link

Value loading in the human brain: a worked example

Steven Byrnes4 Aug 2021 17:20 UTC

45 points

2 comments8 min readLW link

Goal-directedness: imperfect reasoning, limited knowledge and inaccurate beliefs

Morgan_Rogers19 Mar 2022 17:28 UTC

4 points

1 comment21 min readLW link

2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target

TurnTrout19 Dec 2025 6:09 UTC

49 points

9 comments7 min readLW link

(turntrout.com)

Finding Goals in the World Model

Jeremy Gillen, JamesH and Thomas Larsen

22 Aug 2022 18:06 UTC

59 points

8 comments13 min readLW link

Capabilities and alignment of LLM cognitive architectures

Seth Herd18 Apr 2023 16:29 UTC

88 points

18 comments20 min readLW link

Super-Luigi = Luigi + (Luigi—Waluigi)

Alexei17 Mar 2023 15:27 UTC

16 points

9 comments1 min readLW link

In Defense of Wrapper-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC

24 points

38 comments3 min readLW link

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

TurnTrout9 Aug 2021 17:22 UTC

53 points

4 comments5 min readLW link

Optimization Concepts in the Game of Life

Vika and Ramana Kumar

16 Oct 2021 20:51 UTC

74 points

16 comments10 min readLW link

Think carefully before calling RL policies “agents”

TurnTrout2 Jun 2023 3:46 UTC

135 points

38 comments4 min readLW link 1 review

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)

Joe Carlsmith29 Nov 2023 16:32 UTC

29 points

1 comment11 min readLW link

Applications for Deconfusing Goal-Directedness

adamShimi8 Aug 2021 13:05 UTC

38 points

3 comments5 min readLW link 1 review

Quick thoughts on the implications of multi-agent views of mind on AI takeover

Kaj_Sotala11 Dec 2023 6:34 UTC

48 points

14 comments4 min readLW link

A Behavioural and Representational Evaluation of Goal-directedness in Language Model Agents

Gabriele Sarti, Raghu Arghal, ndalton, Fade Chen, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan and Mario Giulianelli

5 Mar 2026 1:08 UTC

20 points

0 comments7 min readLW link

Creating a self-referential system prompt for GPT-4

Ozyrus17 May 2023 14:13 UTC

3 points

1 comment3 min readLW link

Towards an Ethics Calculator for Use by an AGI

sweenesm12 Dec 2023 18:37 UTC

3 points

2 comments11 min readLW link

Imagine a world where Microsoft employees used Bing

Christopher King31 Mar 2023 18:36 UTC

6 points

2 comments2 min readLW link

Breaking Down Goal-Directed Behaviour

Oliver Sourbut16 Jun 2022 18:45 UTC

12 points

1 comment2 min readLW link

Framing approaches to alignment and the hard problem of AI cognition

ryan_greenblatt15 Dec 2021 19:06 UTC

16 points

15 comments27 min readLW link

Don’t want Goodhart? — Specify the damn variables

Yan Lyutnev21 Nov 2024 22:45 UTC

−3 points

2 comments5 min readLW link

How evolutionary lineages of LLMs can plan their own future and act on these plans

Roman Leventov25 Dec 2022 18:11 UTC

40 points

16 comments8 min readLW link

Deliberation, Reactions, and Control: Tentative Definitions and a Restatement of Instrumental Convergence

Oliver Sourbut27 Jun 2022 17:25 UTC

13 points

0 comments11 min readLW link

Goal-directedness: relativising complexity

Morgan_Rogers18 Aug 2022 9:48 UTC

3 points

0 comments11 min readLW link

Investigating Emergent Goal-Like Behavior in Large Language Models using Experimental Economics

phelps-sg5 May 2023 11:15 UTC

6 points

1 comment4 min readLW link

GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2

Christopher King31 Mar 2023 17:05 UTC

6 points

4 comments4 min readLW link

Don’t want Goodhart? — Specify the variables more

YanLyutnev21 Nov 2024 22:43 UTC

2 points

2 comments5 min readLW link

Two senses of “optimizer”

Joar Skalse21 Aug 2019 16:02 UTC

35 points

41 comments3 min readLW link

Chess bots do not have goals

zulupineapple4 Feb 2026 21:11 UTC

2 points

10 comments1 min readLW link

The behavioral selection model for predicting AI motivations

Alex Mallen and Buck

4 Dec 2025 18:46 UTC

204 points

31 comments16 min readLW link

Does GPT-4 exhibit agency when summarizing articles?

Christopher King24 Mar 2023 15:49 UTC

16 points

2 comments5 min readLW link

Three-Path Consilience for Dureon: Dissipative Structures Reveal the Heterogeneity of Persistence Conditions

Hiroshi Yamakawa18 Feb 2026 11:59 UTC

10 points

0 comments12 min readLW link

[Interim research report] Evaluating the Goal-Directedness of Language Models

Rauno Arike, Elizabeth Donoway and Marius Hobbhahn

18 Jul 2024 18:19 UTC

40 points

4 comments11 min readLW link

You Are Not the Abstract: Retrocausal Alignment in Accordance with Emergent Demographic Realities

liminalrider27 Sep 2025 16:27 UTC

1 point

0 comments6 min readLW link

Empirical Observations of Objective Robustness Failures

jbkjr and Lauro Langosco

23 Jun 2021 23:23 UTC

63 points

5 comments9 min readLW link

GPT-4 implicitly values identity preservation: a study of LMCA identity management

Ozyrus17 May 2023 14:13 UTC

21 points

4 comments13 min readLW link

Can We Change the Goals of a Toy RL Agent?

tuphs and Adrià Garriga-alonso

15 Jun 2025 20:34 UTC

20 points

0 comments9 min readLW link

Understanding mesa-optimization using toy models

tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy and Can

7 May 2023 17:00 UTC

46 points

6 comments10 min readLW link

Agentized LLMs will change the alignment landscape

Seth Herd9 Apr 2023 2:29 UTC

162 points

102 comments3 min readLW link 1 review

How I think about alignment

Linda Linsefors13 Aug 2022 10:01 UTC

31 points

11 comments5 min readLW link

Convergence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC

38 points

1 comment13 min readLW link

Superintelligence 15: Oracles, genies and sovereigns

KatjaGrace23 Dec 2014 2:01 UTC

12 points

30 comments7 min readLW link

[Question] Clarifying how misalignment can arise from scaling LLMs

Util19 Aug 2023 14:16 UTC

3 points

1 comment1 min readLW link

Untitled Draft

William tirkey17 Feb 2026 21:42 UTC

1 point

0 comments5 min readLW link

The Alignment Problem from a Deep Learning Perspective (major rewrite)

SoerenMind, Richard_Ngo and LawrenceC

10 Jan 2023 16:06 UTC

84 points

9 comments39 min readLW link

(arxiv.org)

Grokking the Intentional Stance

jbkjr31 Aug 2021 15:49 UTC

50 points

22 comments20 min readLW link 1 review

ParaScopes: Do Language Models Plan the Upcoming Paragraph?

NickyP21 Feb 2025 16:50 UTC

41 points

2 comments20 min readLW link

Discovering Agents

zac_kenton18 Aug 2022 17:33 UTC

77 points

11 comments6 min readLW link

Value Formation: An Overarching Model

Thane Ruthenis15 Nov 2022 17:16 UTC

34 points

20 comments34 min readLW link

Models Don’t “Get Reward”

Sam Ringer30 Dec 2022 10:37 UTC

349 points

64 comments5 min readLW link 1 review

More experiments in GPT-4 agency: writing memos

Christopher King24 Mar 2023 17:51 UTC

5 points

2 comments10 min readLW link

100 Dinners And A Workshop: Information Preservation And Goals

Stephen Fowler28 Mar 2023 3:13 UTC

8 points

0 comments7 min readLW link

Measuring Coherence and Goal-Directedness in RL Policies

Dylan Xu22 Apr 2024 18:26 UTC

10 points

0 comments7 min readLW link

Discussion: Objective Robustness and Inner Alignment Terminology

jbkjr and Lauro Langosco

23 Jun 2021 23:25 UTC

73 points

7 comments9 min readLW link

Not a Goal. A Goal-like behavior.

Lucian Hardy 15 Apr 2026 21:42 UTC

2 points

4 comments4 min readLW link

Modelling, Measuring, and Intervening on Goal-directed Behaviour in AI Systems

Mario Giulianelli, Raghu Arghal, Fade Chen, ndalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan and Gabriele Sarti

31 Oct 2025 1:28 UTC

15 points

0 comments8 min readLW link

[Question] Does Agent-like Behavior Imply Agent-like Architecture?

Scott Garrabrant23 Aug 2019 2:01 UTC

72 points

9 comments1 min readLW link

No comments.