Goal-Directedness

TagLast edit: 4 Jan 2023 3:03 UTC by Daniel_Eth

Goal-directedness is the property of some system to be aiming at some goal. It is in need of formalization, but might prove important in deciding which kind of AI to try to align.

A goal may be defined as a world-state that an agent tries to achieve. Goal-directed agents may generate internal representations of desired end states, compare them against their internal representation of the current state of the world, and formulate plans for navigating from the latter to the former.

The goal-generating function may be derived from a pre-programmed lookup table (for simple worlds), from directly inverting the agent’s utility function (for simple utility functions), or it may be learned through experience mapping states to rewards and predicting which states will produce the largest rewards. The plan-generating algorithm could range from shortest-path algorithms like A* or Dijkstra’s algorithm (for fully-representable world graphs), to policy functions that learn through RL which actions bring the current state closer to the goal state (for simple AI), to some combination or extrapolation (for more advanced AI).

Implicit goal-directedness may come about in agents that do not have explicit internal representations of goals but that nevertheless learn or enact policies that cause the environment to converge on a certain state or set of states. Such implicit goal-directedness may arise, for instance, in simple reinforcement learning agents, which learn a policy function $π : S \to A$ that maps states directly to actions.

Literature Review on Goal-Directedness

adamShimi, Michele Campolo and Joe_Collman

18 Jan 2021 11:15 UTC

80 points

21 comments31 min readLW link

Coherence arguments do not entail goal-directed behavior

Rohin Shah3 Dec 2018 3:26 UTC

129 points

69 comments7 min readLW link 3 reviews

FAQ: What the heck is goal agnosticism?

porby8 Oct 2023 19:11 UTC

66 points

36 comments28 min readLW link

Behavioral Sufficient Statistics for Goal-Directedness

adamShimi11 Mar 2021 15:01 UTC

21 points

12 comments9 min readLW link

Goal-directed = Model-based RL?

adamShimi20 Feb 2020 19:13 UTC

21 points

10 comments3 min readLW link

Will humans build goal-directed agents?

Rohin Shah5 Jan 2019 1:33 UTC

61 points

43 comments5 min readLW link

Intuitions about goal-directed behavior

Rohin Shah1 Dec 2018 4:25 UTC

54 points

15 comments6 min readLW link

Measuring Coherence of Policies in Toy Environments

dx26 and Richard_Ngo

18 Mar 2024 17:59 UTC

59 points

9 comments14 min readLW link

Goal-directedness is behavioral, not structural

adamShimi8 Jun 2020 23:05 UTC

6 points

12 comments3 min readLW link

Focus: you are allowed to be bad at accomplishing your goals

adamShimi3 Jun 2020 21:04 UTC

19 points

17 comments3 min readLW link

AI safety without goal-directed behavior

Rohin Shah7 Jan 2019 7:48 UTC

68 points

15 comments4 min readLW link

Goal-Directedness: What Success Looks Like

adamShimi16 Aug 2020 18:33 UTC

9 points

0 comments2 min readLW link

Deliberation Everywhere: Simple Examples

Oliver Sourbut27 Jun 2022 17:26 UTC

27 points

3 comments15 min readLW link

Goals and short descriptions

Michele Campolo2 Jul 2020 17:41 UTC

14 points

8 comments5 min readLW link

Locality of goals

adamShimi22 Jun 2020 21:56 UTC

16 points

8 comments6 min readLW link

Goal-Directedness and Behavior, Redux

adamShimi9 Aug 2021 14:26 UTC

16 points

4 comments2 min readLW link

Searching for Search

NicholasKees and janus

28 Nov 2022 15:31 UTC

92 points

8 comments14 min readLW link 1 review

P₂B: Plan to P₂B Better

Ramana Kumar and Daniel Kokotajlo

24 Oct 2021 15:21 UTC

38 points

17 comments6 min readLW link

An Appeal to AI Superintelligence: Reasons to Preserve Humanity

James_Miller18 Mar 2023 16:22 UTC

39 points

73 comments12 min readLW link

Refinement of Active Inference agency ontology

Roman Leventov15 Dec 2023 9:31 UTC

16 points

0 comments5 min readLW link

(arxiv.org)

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC

52 points

38 comments24 min readLW link

Capabilities and alignment of LLM cognitive architectures

Seth Herd18 Apr 2023 16:29 UTC

83 points

18 comments20 min readLW link

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)

Joe Carlsmith29 Nov 2023 16:32 UTC

29 points

1 comment11 min readLW link

Think carefully before calling RL policies “agents”

TurnTrout2 Jun 2023 3:46 UTC

127 points

36 comments4 min readLW link

Against the Backward Approach to Goal-Directedness

adamShimi19 Jan 2021 18:46 UTC

19 points

6 comments4 min readLW link

Towards a Mechanistic Understanding of Goal-Directedness

Mark Xu9 Mar 2021 20:17 UTC

45 points

1 comment5 min readLW link

Value loading in the human brain: a worked example

Steven Byrnes4 Aug 2021 17:20 UTC

45 points

2 comments8 min readLW link

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

TurnTrout9 Aug 2021 17:22 UTC

53 points

4 comments5 min readLW link

Applications for Deconfusing Goal-Directedness

adamShimi8 Aug 2021 13:05 UTC

38 points

3 comments5 min readLW link 1 review

A review of “Agents and Devices”

adamShimi13 Aug 2021 8:42 UTC

21 points

0 comments4 min readLW link

Optimization Concepts in the Game of Life

Vika and Ramana Kumar

16 Oct 2021 20:51 UTC

74 points

16 comments11 min readLW link

Goal-directedness: my baseline beliefs

Morgan_Rogers8 Jan 2022 13:09 UTC

21 points

3 comments3 min readLW link

Goal-directedness: exploring explanations

Morgan_Rogers14 Feb 2022 16:20 UTC

13 points

3 comments18 min readLW link

Goal-directedness: imperfect reasoning, limited knowledge and inaccurate beliefs

Morgan_Rogers19 Mar 2022 17:28 UTC

4 points

1 comment21 min readLW link

[Question] why assume AGIs will optimize for fixed goals?

nostalgebraist10 Jun 2022 1:28 UTC

137 points

55 comments4 min readLW link 2 reviews

wrapper-minds are the enemy

nostalgebraist17 Jun 2022 1:58 UTC

100 points

41 comments8 min readLW link

Goal-directedness: tackling complexity

Morgan_Rogers2 Jul 2022 13:51 UTC

8 points

0 comments38 min readLW link

Finding Goals in the World Model

Jeremy Gillen, JamesH and Thomas Larsen

22 Aug 2022 18:06 UTC

59 points

8 comments13 min readLW link

In Defense of Wrapper-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC

23 points

38 comments3 min readLW link

Evil autocomplete: Existential Risk and Next-Token Predictors

Yitz28 Feb 2023 8:47 UTC

9 points

3 comments5 min readLW link

Super-Luigi = Luigi + (Luigi—Waluigi)

Alexei17 Mar 2023 15:27 UTC

16 points

9 comments1 min readLW link

Quick thoughts on the implications of multi-agent views of mind on AI takeover

Kaj_Sotala11 Dec 2023 6:34 UTC

41 points

14 comments4 min readLW link

Measuring Coherence and Goal-Directedness in RL Policies

dx2622 Apr 2024 18:26 UTC

3 points

0 comments7 min readLW link

A thought experiment to help persuade skeptics that power-seeking AI is plausible

jacobcd5225 Nov 2023 23:26 UTC

1 point

4 comments5 min readLW link

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC

626 points

187 comments16 min readLW link

Understanding mesa-optimization using toy models

tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy and Can

7 May 2023 17:00 UTC

42 points

2 comments10 min readLW link

The Alignment Problem from a Deep Learning Perspective (major rewrite)

SoerenMind, Richard_Ngo and LawrenceC

10 Jan 2023 16:06 UTC

84 points

8 comments39 min readLW link

(arxiv.org)

Discussion: Objective Robustness and Inner Alignment Terminology

jbkjr and Lauro Langosco

23 Jun 2021 23:25 UTC

73 points

7 comments9 min readLW link

Empirical Observations of Objective Robustness Failures

jbkjr and Lauro Langosco

23 Jun 2021 23:23 UTC

63 points

5 comments9 min readLW link

Convergence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC

38 points

1 comment13 min readLW link

How I think about alignment

Linda Linsefors13 Aug 2022 10:01 UTC

31 points

11 comments5 min readLW link

Discovering Agents

zac_kenton18 Aug 2022 17:33 UTC

73 points

11 comments6 min readLW link

Goal-directedness: relativising complexity

Morgan_Rogers18 Aug 2022 9:48 UTC

3 points

0 comments11 min readLW link

Grokking the Intentional Stance

jbkjr31 Aug 2021 15:49 UTC

45 points

22 comments20 min readLW link 1 review

[Question] Does Agent-like Behavior Imply Agent-like Architecture?

Scott Garrabrant23 Aug 2019 2:01 UTC

58 points

8 comments1 min readLW link

Framing approaches to alignment and the hard problem of AI cognition

ryan_greenblatt15 Dec 2021 19:06 UTC

16 points

15 comments27 min readLW link

How evolutionary lineages of LLMs can plan their own future and act on these plans

Roman Leventov25 Dec 2022 18:11 UTC

39 points

16 comments8 min readLW link

Two senses of “optimizer”

Joar Skalse21 Aug 2019 16:02 UTC

35 points

41 comments3 min readLW link

Agentized LLMs will change the alignment landscape

Seth Herd9 Apr 2023 2:29 UTC

153 points

96 comments3 min readLW link

Value Formation: An Overarching Model

Thane Ruthenis15 Nov 2022 17:16 UTC

34 points

20 comments34 min readLW link

Breaking Down Goal-Directed Behaviour

Oliver Sourbut16 Jun 2022 18:45 UTC

11 points

1 comment2 min readLW link

Towards an Ethics Calculator for Use by an AGI

sweenesm12 Dec 2023 18:37 UTC

3 points

2 comments11 min readLW link

Investigating Emergent Goal-Like Behavior in Large Language Models using Experimental Economics

phelps-sg5 May 2023 11:15 UTC

6 points

1 comment4 min readLW link

GPT-4 implicitly values identity preservation: a study of LMCA identity management

Ozyrus17 May 2023 14:13 UTC

21 points

4 comments13 min readLW link

Creating a self-referential system prompt for GPT-4

Ozyrus17 May 2023 14:13 UTC

3 points

1 comment3 min readLW link

Imagine a world where Microsoft employees used Bing

Christopher King31 Mar 2023 18:36 UTC

6 points

2 comments2 min readLW link

GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2

Christopher King31 Mar 2023 17:05 UTC

6 points

4 comments4 min readLW link

100 Dinners And A Workshop: Information Preservation And Goals

Stephen Fowler28 Mar 2023 3:13 UTC

8 points

0 comments7 min readLW link

Does GPT-4 exhibit agency when summarizing articles?

Christopher King24 Mar 2023 15:49 UTC

16 points

2 comments5 min readLW link

More experiments in GPT-4 agency: writing memos

Christopher King24 Mar 2023 17:51 UTC

5 points

2 comments10 min readLW link

[Interim research report] Evaluating the Goal-Directedness of Language Models

Rauno Arike, Elizabeth Donoway and Marius Hobbhahn

18 Jul 2024 18:19 UTC

29 points

0 comments11 min readLW link

Psychological issues often have an immediate payoff

Chipmonk10 Jun 2024 23:39 UTC

23 points

2 comments4 min readLW link

(chrislakin.blog)

Deliberation, Reactions, and Control: Tentative Definitions and a Restatement of Instrumental Convergence

Oliver Sourbut27 Jun 2022 17:25 UTC

12 points

0 comments11 min readLW link

Superintelligence 15: Oracles, genies and sovereigns

KatjaGrace23 Dec 2014 2:01 UTC

11 points

30 comments7 min readLW link

[Question] Clarifying how misalignment can arise from scaling LLMs

Util19 Aug 2023 14:16 UTC

3 points

1 comment1 min readLW link

No comments.