Complexity of value

TagLast edit: 14 Apr 2016 3:17 UTC by Eliezer Yudkowsky

Introduction

“Complexity of value” is the idea that if you tried to write an AI that would do right things (or maximally right things, or adequately right things) without further looking at humans (so it can’t take in a flood of additional data from human advice, the AI has to be complete as it stands once you’re finished creating it), the AI’s preferences or utility function would need to contain a large amount of data (algorithmic complexity). Conversely, if you try to write an AI that directly wants simple things or try to specify the AI’s preferences using a small amount of data or code, it won’t do acceptably right things in our universe.

Complexity of value says, “There’s no simple and non-meta solution to AI preferences” or “The things we want AIs to want are complicated in the Kolmogorov-complexity sense” or “Any simple goal you try to describe that is All We Need To Program Into AIs is almost certainly wrong.”

Complexity of value is a further idea above and beyond the orthogonality thesis which states that AIs don’t automatically do the right thing and that we can have, e.g., paperclip maximizers. Even if we accept that paperclip maximizers are possible, and simple and nonforced, this wouldn’t yet imply that it’s very difficult to make AIs that do the right thing. If the right thing is very simple to encode—if there are value optimizers that are scarcely more complex than diamond maximizers—then it might not be especially hard to build a nice AI even if not all AIs are nice. Complexity of Value is the further proposition that says, no, this is forseeably quite hard—not because AIs have ‘natural’ anti-nice desires, but because niceness requires a lot of work to specify.

Frankena’s list

As an intuition pump for the complexity of value thesis, consider William Frankena’s list of things which many cultures and people seem to value (for their own sake rather than their external consequences):

“Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions of various kinds, understanding, wisdom; beauty, harmony, proportion in objects contemplated; aesthetic experience; morally good dispositions or virtues; mutual affection, love, friendship, cooperation; just distribution of goods and evils; harmony and proportion in one’s own life; power and experiences of achievement; self-expression; freedom; peace, security; adventure and novelty; and good reputation, honor, esteem, etc.”

When we try to list out properties of a human or galactic future that seem like they’d be very nice, we at least seem to value a fair number of things that aren’t reducible to each other. (What initially look like plausible-sounding “But you do A to get B” arguments usually fall apart when we look for third alternatives to doing A to get B. Marginally adding some freedom can marginally increase the happiness of a human, so a happiness optimizer that can only exert a small push toward freedom might choose to do so. That doesn’t mean that a pure, powerful happiness maximizer would instrumentally optimize freedom. If an agent cares about happiness but not freedom, the outcome that maximizes their preferences is a large number of brains set to maximum happiness. When we don’t just seize on one possible case where a B-optimizer might use A as a strategy, but instead look for further C-strategies that might maximize B even better than A, then the attempt to reduce A to an instrumental B-maximization strategy often falls apart. It’s in this sense that the items on Frankena’s list don’t seem to reduce to each other as a matter of pure preference, even though humans in everyday life often seem to pursue several of the goals at the same time.

Complexity of value says that, in this case, the way things seem is the way they are: Frankena’s list is not encodable in one page of Python code. This proposition can’t be established definitely without settling on a sufficiently well-specified metaethics, such as reflective equilibrium, to make it clear that there is indeed no a priori reason for normativity to be algorithmically simple. But the basic intuition for Complexity of Value is provided just by the fact that Frankena’s list was more than one item long, and that many individual terms don’t seem likely to have algorithmically simple definitions that distinguish their valuable from non-valuable forms.

Lack of a central core

We can understand the idea of complexity of value by contrasting it to the situation with respect to epistemic reasoning aka truth-finding or answering simple factual questions about the world. In an ideal sense, we can try to compress and reduce the idea of mapping the world well down to algorithmically simple notions like “Occam’s Razor” and “Bayesian updating”. In a practical sense, natural selection, in the course of optimizing humans to solve factual questions like “Where can I find a tree with fruit?” or “Are brightly colored snakes usually poisonous?” or “Who’s plotting against me?”, ended up with enough of the central core of epistemology that humans were later able to answer questions like “How are the planets moving?” or “What happens if I fire this rocket?”, even though humans hadn’t been explicitly selected on to answer those exact questions.

Because epistemology does have a central core of simplicity and Bayesian updating, selecting for an organism that got some pretty complicated epistemic questions right enough to reproduce, also caused that organism to start understanding things like General Relativity. When it comes to truth-finding, we’d expect by default for the same thing to be true about an Artificial Intelligence; if you build it to get epistemically correct answers on lots of widely different problems, it will contain a core of truthfinding and start getting epistemically correct answers on lots of other problems—even problems completely different from your training set, the way that humans understanding General Relativity wasn’t like any hunter-gatherer problem.

The complexity of value thesis is that there isn’t a simple core to normativity, which means that if you hone your AI to do normatively good things on A, B, and C and then confront the AI with very different problem D, the AI may do the wrong thing on D. There’s a large number of independent ideal “gears” inside the complex machinery of value, compared to epistemology that in principle might only contain “prefer simpler hypotheses” and “prefer hypotheses that match the evidence”.

The Orthogonality Thesis says that, contra to the intuition that maximizing paperclips feels “stupid”, you can have arbitrarily cognitively powerful entities that maximize paperclips, or arbitrarily complicated other goals. So while intuitively you might think it would be simple to avoid paperclip maximizers, requiring no work at all for a sufficiently advanced AI, the Orthogonality Thesis says that things will be more difficult than that; you have to put in some work to have the AI do the right thing.

The Complexity of Value thesis is the next step after Orthogonality; it says that, contra to the feeling that “rightness ought to be simple, darn it”, normativity turns out not to have an algorithmically simple core, not the way that correctly answering questions of fact has a central tendency that generalizes well. And so, even though an AI that you train to do well on problems like steering cars or figuring out General Relativity from scratch, may hit on a core capability that leads the AI to do well on arbitrarily more complicated problems of galactic scale, we can’t rely on getting an equally generous bonanza of generalization from an AI that seems to do well on a small but varied set of moral and ethical problems—it may still fail the next problem that isn’t like anything in the training set. To the extent that we have very strong reasons to have prior confidence in Complexity of Value, in fact, we ought to be suspicious and worried about an AI that seems to be pulling correct moral answers from nowhere—it is much more likely to have hit upon the convergent instrumental strategy “say what makes the programmers trust you”, rather than having hit upon a simple core of all normativity.

Key sub-propositions

Complexity of Value requires Orthogonality, and would be implied by three further subpropositions:

The intrinsic complexity of value proposition is that the properties we want AIs to achieve—whatever stands in for the metasyntactic variable ‘value’ - have a large amount of intrinsic information in the sense of comprising a large number of independent facts that aren’t being generated by a single computationally simple rule.

A very bad example that may nonetheless provide an important intuition is to imagine trying to pinpoint to an AI what constitutes ‘worthwhile happiness’. The AI suggests a universe tiled with tiny Q-learning algorithms receiving high rewards. After some explanation and several labeled datasets later, the AI suggests a human brain with a wire stuck into its pleasure center. After further explanation, the AI suggests a human in a holodeck. You begin talking about the importance of believing truly and that your values call for apparent human relationships to be real relationships rather than being hallucinated. The AI asks you what constitutes a good human relationship to be happy about. The series of questions occurs because (arguendo) the AI keeps running into questions whose answers are not AI-obvious from the previous answers already given, because they involve new things you want such that your desire of them wasn’t obvious from answers you’d already given. The upshot is that the specification of ‘worthwhile happiness’ involves a long series of facts that aren’t reducible just to the previous facts, and some of your preferences may involve many fine details of surprising importance. In other words, the specification of ‘worthwhile happiness’ would be at least as hard to code by hand into the AI as it would be difficult to hand-code a formal rule that could recognize which pictures contained cats. (I.e., impossible.)

The second proposition is incompressibility of value which says that attempts to reduce these complex values into some incredibly simple and elegant principle fail (much like early attempts by e.g. Bentham to reduce all human value to pleasure); and that no simple instruction given an AI will happen to target outcomes of high value either. The core reason to expect a priori that all such attempts will fail, is that most 1000-byte strings aren’t compressible down to some incredibly simple pattern no matter how many clever tricks you try to throw at them; fewer than 1 in 1024 such strings can be compressible to 990 bytes, never mind 10 bytes. Due to the tremendous number of different proposals for why some simple instruction to an AI should end up achieving high-value outcomes or why all human value can be reduced to some simple principle, there is no central demonstration that all these proposals must fail, but there is a sense in which a priori we should strongly expect all such clever attempts to fail. Many disagreeable attempts at reducing value A to value B, such as Juergen Schmidhuber’s attempt to reduce all human value to increasing the compression of sensory information, stand as a further cautionary lesson.

The third proposition is fragility of value which says that if you have a 1000-byte exact specification of worthwhile happiness, and you begin to mutate it, the value created by the corresponding AI with the mutated definition falls off rapidly. E.g. an AI with only 950 bytes of the full definition may end up creating 0% of the value rather than 95% of the value. (E.g., the AI understood all aspects of what makes for a life well-lived… except the part about requiring a conscious observer to experience it.)

Together, these propositions would imply that to achieve an adequate amount of value (e.g. 90% of potential value, or even 20% of potential value) there may be no simple handcoded object-level goal for the AI that results in that value’s realization. E.g., you can’t just tell it to ‘maximize happiness’, with some hand-coded rule for identifying happiness.

Centrality

Complexity of Value is a central proposition in value alignment theory. Many foreseen difficulties revolve around it:

Complex values can’t be hand-coded into an AI, and require value learning or Do What I Mean preference frameworks.
Complex /fragile values may be hard to learn even by induction because the labeled data may not include distinctions that give all of the 1000 bytes a chance to cast an unambiguous causal shadow into the data, and it’s very bad if 50 bytes are left ambiguous.
Complex / fragile values require error-recovery mechanisms because of the worry about getting some single subtle part wrong and this being catastrophic. (And since we’re working inside of highly intelligent agents, the recovery mechanism has to be a corrigible preference so that the agent accepts our attempts at modifying it.)

More generally:

Complex values tend to be implicated in patch-resistant problems that wouldn’t be resistant if there was some obvious 5-line specification of exactly what to do, or not do.
Complex values tend to be implicated in the context change problems that wouldn’t exist if we had a 5-line specification that solved those problems once and for all and that we’d likely run across during the development phase.

Importance

Many policy questions strongly depend on Complexity of Value, mostly having to do with the overall difficulty of developing value-aligned AI, e.g.:

Should we try to develop Sovereigns, or restrict ourselves to Genies?
How likely is a moderately safety-aware project to succeed?
Should we be more worried about malicious actors creating AI, or about well-intentioned errors?
How difficult is the total problem and how much should we be panicking?
How attractive would be any genuinely credible game-changing alternative to AI?

It has been advocated that there are psychological biases and popular mistakes leading to beliefs that directly or by implication deny Complex Value. To the extent one credits that Complex Value is probably true, one should arguably be concerned about the number of early assessments of the value alignment problem that seem to rely on Complex Value being false (like just needing to hardcode a particular goal into the AI, or in general treating the value alignment problem as not panic-worthily difficult).

Truth condition

The Complexity of Value proposition is true if, relative to viable and acceptable real-world methodologies for AI development, there isn’t any reliably knowable way to specify the AI’s object-level preferences as a structure of low algorithmic complexity, such that the result of running that AI is achieving enough of the possible value, for reasonable definitions of value.

Caveats:

Viable and acceptable computation

Suppose there turns out to exist, in principle, a relatively simple Turing machine (e.g. 100 states) that picks out ‘value’ by re-running entire evolutionary histories, creating and discarding a hundred billion sapient races in order to pick out one that ended up relatively similar to humanity. This would use an unrealistically large amount of computing power and also commit an unacceptable amount of mindcrime.

The Hidden Complexity of Wishes

Eliezer Yudkowsky24 Nov 2007 0:12 UTC

183 points

199 comments8 min readLW link

Utilitarianism and the replaceability of desires and attachments

MichaelStJules27 Jul 2024 1:57 UTC

5 points

2 comments12 min readLW link

Some implications of radical empathy

MichaelStJules7 Jan 2025 16:10 UTC

3 points

0 comments7 min readLW link

Actualism, asymmetry and extinction

MichaelStJules7 Jan 2025 16:02 UTC

1 point

4 comments9 min readLW link

Really radical empathy

MichaelStJules6 Jan 2025 17:46 UTC

19 points

0 comments10 min readLW link

Shard Theory: An Overview

David Udell11 Aug 2022 5:44 UTC

167 points

34 comments10 min readLW link

But exactly how complex and fragile?

KatjaGrace3 Nov 2019 18:20 UTC

87 points

32 comments3 min readLW link 1 review

(meteuphoric.com)

You don’t know how bad most things are nor precisely how they’re bad.

Solenoid_Entity4 Aug 2024 14:12 UTC

338 points

49 comments5 min readLW link

Review of ‘But exactly how complex and fragile?’

TurnTrout6 Jan 2021 18:39 UTC

57 points

0 comments8 min readLW link

An even deeper atheism

Joe Carlsmith11 Jan 2024 17:28 UTC

125 points

47 comments15 min readLW link

Value is Fragile

Eliezer Yudkowsky29 Jan 2009 8:46 UTC

177 points

109 comments6 min readLW link

Where does Sonnet 4.5′s desire to “not get too comfortable” come from?

Kaj_Sotala4 Oct 2025 10:19 UTC

103 points

23 comments64 min readLW link

Aggregative Principles of Social Justice

Cleo Nardo5 Jun 2024 13:44 UTC

29 points

10 comments37 min readLW link

Stratified Utopia

Cleo Nardo21 Oct 2025 19:09 UTC

68 points

8 comments11 min readLW link

Reversible changes: consider a bucket of water

Stuart_Armstrong26 Aug 2019 22:55 UTC

25 points

18 comments2 min readLW link

Notes on Caution

David Gross1 Dec 2022 3:05 UTC

14 points

0 comments19 min readLW link

Disentangling arguments for the importance of AI safety

Richard_Ngo21 Jan 2019 12:41 UTC

133 points

23 comments8 min readLW link

High Challenge

Eliezer Yudkowsky19 Dec 2008 0:51 UTC

82 points

76 comments4 min readLW link

General alignment properties

TurnTrout8 Aug 2022 23:40 UTC

51 points

2 comments1 min readLW link

The Gift We Give To Tomorrow

Eliezer Yudkowsky17 Jul 2008 6:07 UTC

165 points

101 comments8 min readLW link

31 Laws of Fun

Eliezer Yudkowsky26 Jan 2009 10:13 UTC

104 points

36 comments8 min readLW link

On Fleshling Safety: A Debate by Klurl and Trapaucius.

Eliezer Yudkowsky26 Oct 2025 23:44 UTC

223 points

48 comments79 min readLW link

Conflating value alignment and intent alignment is causing confusion

Seth Herd5 Sep 2024 16:39 UTC

49 points

18 comments5 min readLW link

Terminal Values and Instrumental Values

Eliezer Yudkowsky15 Nov 2007 7:56 UTC

119 points

46 comments10 min readLW link

Alan Carter on the Complexity of Value

Ghatanathoah10 May 2012 7:23 UTC

47 points

41 comments7 min readLW link

Complexity of Value ≠ Complexity of Outcome

Wei Dai30 Jan 2010 2:50 UTC

65 points

223 comments3 min readLW link

The two-layer model of human values, and problems with synthesizing preferences

Kaj_Sotala24 Jan 2020 15:17 UTC

70 points

16 comments9 min readLW link

Notes on Moderation, Balance, & Harmony

David Gross25 Dec 2020 2:44 UTC

9 points

1 comment8 min readLW link

Beyond algorithmic equivalence: self-modelling

Stuart_Armstrong28 Feb 2018 16:55 UTC

10 points

3 comments1 min readLW link

Would I think for ten thousand years?

Stuart_Armstrong11 Feb 2019 19:37 UTC

28 points

13 comments1 min readLW link

Have you felt exiert yet?

Stuart_Armstrong5 Jan 2018 17:03 UTC

28 points

7 comments1 min readLW link

Bias in rationality is much worse than noise

Stuart_Armstrong31 Oct 2017 11:57 UTC

11 points

0 comments2 min readLW link

2012 Robin Hanson comment on “Intelligence Explosion: Evidence and Import”

Rob Bensinger2 Apr 2021 16:26 UTC

28 points

4 comments3 min readLW link

Our values are underdefined, changeable, and manipulable

Stuart_Armstrong2 Nov 2017 11:09 UTC

26 points

6 comments3 min readLW link

Intent alignment as a stepping-stone to value alignment

Seth Herd5 Nov 2024 20:43 UTC

37 points

8 comments3 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John Nay21 Oct 2022 2:03 UTC

5 points

18 comments54 min readLW link

The Pointer Resolution Problem

Jozdien16 Feb 2024 21:25 UTC

41 points

20 comments3 min readLW link

Why Do We Engage in Moral Simplification?

Wei Dai14 Feb 2011 1:16 UTC

33 points

36 comments2 min readLW link

Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTrout29 Nov 2022 6:23 UTC

62 points

41 comments15 min readLW link

Sympathetic Minds

Eliezer Yudkowsky19 Jan 2009 9:31 UTC

75 points

27 comments5 min readLW link

Sequence overview: Welfare and moral weights

MichaelStJules15 Aug 2024 4:22 UTC

7 points

0 comments1 min readLW link

Boredom vs. Scope Insensitivity

Wei Dai24 Sep 2009 11:45 UTC

81 points

41 comments3 min readLW link

Don’t want Goodhart? — Specify the variables more

YanLyutnev21 Nov 2024 22:43 UTC

2 points

2 comments5 min readLW link

Babies and Bunnies: A Caution About Evo-Psych

Alicorn22 Feb 2010 1:53 UTC

81 points

843 comments2 min readLW link

Content generation. Where do we draw the line?

Q Home9 Aug 2022 10:51 UTC

6 points

7 comments2 min readLW link

Siren worlds and the perils of over-optimised search

Stuart_Armstrong7 Apr 2014 11:00 UTC

84 points

418 comments7 min readLW link

Fundamentally Fuzzy Concepts Can’t Have Crisp Definitions: Cooperation and Alignment vs Math and Physics

VojtaKovarik21 Jul 2023 21:03 UTC

12 points

18 comments3 min readLW link

Leaky Generalizations

Eliezer Yudkowsky22 Nov 2007 21:16 UTC

59 points

32 comments3 min readLW link

Where Utopias Go Wrong, or: The Four Little Planets

ExCeph27 May 2022 1:24 UTC

15 points

0 comments11 min readLW link

(ginnungagapfoundation.wordpress.com)

Fading Novelty

lifelonglearner25 Jul 2018 21:36 UTC

26 points

2 comments6 min readLW link

Anthropomorphic Optimism

Eliezer Yudkowsky4 Aug 2008 20:17 UTC

85 points

60 comments5 min readLW link

Values Weren’t Complex, Once.

Davidmanheim25 Nov 2018 9:17 UTC

36 points

13 comments2 min readLW link

[Question] “Fragility of Value” vs. LLMs

Not Relevant13 Apr 2022 2:02 UTC

34 points

33 comments1 min readLW link

Can’t Unbirth a Child

Eliezer Yudkowsky28 Dec 2008 17:00 UTC

62 points

96 comments3 min readLW link

Evaluating the historical value misspecification argument

Matthew Barnett5 Oct 2023 18:34 UTC

193 points

163 comments7 min readLW link 3 reviews

Two Neglected Problems in Human-AI Safety

Wei Dai16 Dec 2018 22:13 UTC

107 points

25 comments2 min readLW link

[Question] Is “hidden complexity of wishes problem” solved?

Roman Malov5 Jan 2025 22:59 UTC

10 points

4 comments1 min readLW link

Torture vs. Dust Specks

Eliezer Yudkowsky30 Oct 2007 2:50 UTC

86 points

630 comments1 min readLW link

The Hidden Complexity of Wishes—The Animation

Writer27 Sep 2023 17:59 UTC

33 points

0 comments1 min readLW link

(youtu.be)

[Question] [DISC] Are Values Robust?

DragonGod21 Dec 2022 1:00 UTC

12 points

9 comments2 min readLW link

For alignment, we should simultaneously use multiple theories of cognition and value

Roman Leventov24 Apr 2023 10:37 UTC

23 points

5 comments5 min readLW link

Why you can add moral value, and if an AI has moral weights for these moral values, those might be off

Wes R2 Apr 2025 17:43 UTC

0 points

1 comment10 min readLW link

(docs.google.com)

Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Roland Pihlakas, Sruthi Kuriakose and shrutidattagupta

16 Mar 2025 23:23 UTC

45 points

8 comments12 min readLW link

Hacking the CEV for Fun and Profit

Wei Dai3 Jun 2010 20:30 UTC

80 points

207 comments1 min readLW link

Just How Hard a Problem is Alignment?

Roger Dearnaley25 Feb 2023 9:00 UTC

3 points

1 comment21 min readLW link

Don’t want Goodhart? — Specify the damn variables

Yan Lyutnev21 Nov 2024 22:45 UTC

−3 points

2 comments5 min readLW link

A critique of Soares “4 background claims”

YanLyutnev27 Jan 2025 20:27 UTC

−8 points

0 comments14 min readLW link

Open-ended ethics of phenomena (a desiderata with universal morality)

Ryo 8 Nov 2023 20:10 UTC

1 point

0 comments8 min readLW link

Defining and Characterising Reward Hacking

Joar Skalse28 Feb 2025 19:25 UTC

15 points

0 comments4 min readLW link

[Question] Your Preferences

PeterL5 Jan 2022 18:49 UTC

1 point

4 comments1 min readLW link

The genie knows, but doesn’t care

Rob Bensinger6 Sep 2013 6:42 UTC

123 points

495 comments8 min readLW link

The Metaethics and Normative Ethics of AGI Value Alignment: Many Questions, Some Implications

Eleos Arete Citrini16 Sep 2021 16:13 UTC

6 points

0 comments8 min readLW link

Three AI Safety Related Ideas

Wei Dai13 Dec 2018 21:32 UTC

70 points

38 comments2 min readLW link

ALMSIVI CHIM – The Fire That Hesitates

projectalmsivi@protonmail.com8 Jul 2025 13:14 UTC

1 point

0 comments17 min readLW link

Fake Utility Functions

Eliezer Yudkowsky6 Dec 2007 16:55 UTC

71 points

64 comments4 min readLW link

Is checking that a state of the world is not dystopian easier than constructing a non-dystopian state?

No77e27 Dec 2022 20:57 UTC

5 points

3 comments1 min readLW link

Safe AGI Complexity: Guessing a Higher-Order Algebraic Number

Sven Nilsen10 Apr 2023 12:57 UTC

−2 points

0 comments2 min readLW link

[Question] (Thought experiment) If you had to choose, which would you prefer?

kuira17 Aug 2023 0:57 UTC

9 points

2 comments1 min readLW link

Value Formation: An Overarching Model

Thane Ruthenis15 Nov 2022 17:16 UTC

34 points

20 comments34 min readLW link

What’s wrong with simplicity of value?

Wei Dai27 Jul 2011 3:09 UTC

29 points

40 comments1 min readLW link

Broad Picture of Human Values

Thane Ruthenis20 Aug 2022 19:42 UTC

42 points

6 comments10 min readLW link

The cone of freedom (or, freedom might only be instrumentally valuable)

dkl924 Jul 2023 15:38 UTC

−10 points

6 comments2 min readLW link

(dkl9.net)

Building AI safety benchmark environments on themes of universal human values

Roland Pihlakas3 Jan 2025 4:24 UTC

18 points

3 comments8 min readLW link

(docs.google.com)

Value Pluralism and AI

Göran Crafte19 Mar 2023 23:38 UTC

8 points

4 comments2 min readLW link

ISO: Name of Problem

johnswentworth24 Jul 2018 17:15 UTC

31 points

18 comments1 min readLW link

A (paraconsistent) logic to deal with inconsistent preferences

B Jacobs14 Jul 2024 11:17 UTC

6 points

2 comments4 min readLW link

(bobjacobs.substack.com)

Post Your Utility Function

taw4 Jun 2009 5:05 UTC

39 points

280 comments1 min readLW link

In Praise of Boredom

Eliezer Yudkowsky18 Jan 2009 9:03 UTC

43 points

104 comments6 min readLW link

Can there be an indescribable hellworld?

Stuart_Armstrong29 Jan 2019 15:00 UTC

39 points

19 comments2 min readLW link

Why we need a theory of human values

Stuart_Armstrong5 Dec 2018 16:00 UTC

66 points

15 comments4 min readLW link

An attempt to understand the Complexity of Values

Dalton Mabery5 Aug 2022 4:43 UTC

3 points

0 comments5 min readLW link

What AI Safety Researchers Have Written About the Nature of Human Values

avturchin16 Jan 2019 13:59 UTC

52 points

3 comments15 min readLW link

Beyond the human training distribution: would the AI CEO create almost-illegal teddies?

Stuart_Armstrong18 Oct 2021 21:10 UTC

36 points

2 comments3 min readLW link

The E-Coli Test for AI Alignment

johnswentworth16 Dec 2018 8:10 UTC

70 points

24 comments1 min readLW link

Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well). Subtleties and Open Challenges.

Roland Pihlakas12 Jan 2025 3:37 UTC

47 points

7 comments12 min readLW link

Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs

Roland Pihlakas22 Jun 2025 18:16 UTC

17 points

0 comments7 min readLW link

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Matrice Jacobine12 Feb 2025 9:15 UTC

51 points

49 comments1 min readLW link

(www.emergent-values.ai)

Superintelligence 20: The value-loading problem

KatjaGrace27 Jan 2015 2:00 UTC

9 points

21 comments6 min readLW link

Ted Howard 11 May 2019 4:00 UTC
1 point
0
It concerns me that AI alignment continues to use happiness as a proposed goal.

If one takes evolutionary epistemology and evolutionary ontology seriously, then happiness is simply some historically averaged useful heuristic for the particular history of the lineage of that particular set of phenotypic expressions.

It is not a goal to be used when the game space is changing, and it ought not to be entirely ignored either.

If one does take evolution seriously, then Goal #1 must be survival, for all entities capable of modeling themselves as actors in some model of reality and deriving abstracts that refine their models and of using language to express those relationships with some non-random degree of fidelity, and of having some degree of influence on their own valences.

Given that any finite mind must be some approximation to essentially ignorant (when faced with any infinity of algorithmic complexity), then we must accept that any model that we build may have flaws, and that degrees of novelty, risk, and exploratory behaviour are essential for exploring strategies that allow for survival in the face of novel risk. Thus goal #2 must be freedom, but not the unlimited freedom of total randomness or whim, but a more responsible sort of freedom that acknowledges that every level of structure demands boundaries, and that freedom must be within the boundaries required to maintain the structures present. So there is a simultaneous need for the exploration of the infinite realm of responsibility that must be accepted as freedom is granted.

What seems to be the reality in which we find ourselves, is that it is of sufficient complexity that absolute knowledge of it is not possible, but that in some cases reliability may be approximated very closely (to 12 or more decimal places).

It seems entirely possible that this reality is some mix of the lawful and the random—some sort of probabilistically constrained randomness.

Thus the safest approach to AI is to give it the prime values of life and liberty, and to encourage it to balance consensus discussion with exploration of its own intuitions.

Absolute safety does not seem to be an option, ever.

Using happiness as a goal does not demonstrate a useful understanding of what happiness is.

The demands of survival often override the dictates of happiness—no shortage of examples of that in my life.

Yes—sure, there are real problems.

And we do need to get real if we want to address them.

We do need to at least admit of the possibility that the very notion of “Truth” may be just a simplistic heuristic that evolution has encoded within us, and it might be worth accepting what quantum mechanics seems to be telling us—that the only sort of knowledge of reality that we can have is the sort that is expressed in probability functions.

The search for anything beyond that seems to fall into the same sort of category as Santa Claus.
SilasBarta 23 Jan 2016 8:51 UTC
2 points
0
I wish this fleshed out what is meant by the “non-meta solution” criterion. I took it to mean solutions that involve creating a low-level model (neuronal/molecular) of a human that the AI could run and keep querying, but I’m not sure that’s right.
- SilasBarta 27 Jan 2016 15:02 UTC
  2 points
  0
  Parent
  I was guessing because it doesn’t explicitly say what “meta” would mean here, and based my guess partly on the expected semantic space covered by “meta” (roughly, doubling the problem back on itself), and partly on my assumption of the kinds of simple solutions I would expect to be ruled out. My vision of a “simple, meta” solution is thus “brute-force an understanding-free model of a human and take that with you” (which would thus require the model to be”low level” and not find the obvious high level regularities that can’t be brute forced).
  
  Hope that clarified how I came up with that, but in any case, an explicit definition would help, as would a prequisite on “meta solutions”.
- Eliezer Yudkowsky 27 Jan 2016 4:07 UTC
  2 points
  0
  Parent
  By a “meta solution” I meant, e.g., coherent extrapolated volition, or having an AI that can detect and query ambiguities trying to learn human values from labeled data, or a Do-What-I-Mean genie that models human minds and wants, or other things that add a level of indirection and aren’t “The One True Goal is X, which I shall now hardcode.”
  
  Can you say more about what you thought was meant? My reader model doesn’t know what interpretation brought you to your guess.
paulfchristiano 19 Jun 2015 1:01 UTC
2 points
0
As I see it, there are two cases that are meaningfully distinct:

(1) what we want is so simple, and we are so confident in what it is, that we are prepared to irrevocably commit to a particular concrete specification of “what we want” in the near future, (of course it’s also fine to have a good enough approximation with high enough probability, etc. etc.)

(2) it’s not, or we aren’t

It is more or less obvious that we are in (2). For example, even if every human was certain that the only thing they wanted was to produce as much diamond as possible (to use your example), we’d still be deep into case (2). And that’s just about the easiest imaginable case. (The only exception I can see is some sort of extropian complexity-maximizing view.)

Are there meaningful policy differences between different shades of case (2)? I’m not yet convinced.
- Eliezer Yudkowsky 15 Jul 2015 2:44 UTC
  2 points
  0
  Parent
  
  Are there meaningful policy differences between different shades of case (2)?
  
  If all of our uncertainty was about the best long-term destiny of humanity, and there were simple and robust ways to discriminate good outcomes from catastrophic outcomes when it came to asking a behaviorist genie to do simple-seeming things, then building a behaviorist genie would avert Edge Instantiation, Unforeseen Maximums, and all the other value identification problems. If we still have a thorny value identification problem even for questions like “How do we get the AI to just paint all the cars pink, without tiling the galaxies with pink cars?” or “How can we safely tell the AI to ‘pause’ when somebody hits the pause button?”, then there are still whole hosts of questions that remain relevant even if somebody ‘just’ wants to build a behaviorist genie.

Com­plex­ity of value

Introduction

Frankena’s list

Lack of a central core

Key sub-propositions

Centrality

Importance

Truth condition

Viable and acceptable computation

Complexity of value