RSS

Com­plex­ity of value

TagLast edit: 14 Apr 2016 3:17 UTC by Eliezer Yudkowsky

Introduction

“Complexity of value” is the idea that if you tried to write an AI that would do right things (or maximally right things, or adequately right things) without further looking at humans (so it can’t take in a flood of additional data from human advice, the AI has to be complete as it stands once you’re finished creating it), the AI’s preferences or utility function would need to contain a large amount of data (algorithmic complexity). Conversely, if you try to write an AI that directly wants simple things or try to specify the AI’s preferences using a small amount of data or code, it won’t do acceptably right things in our universe.

Complexity of value says, “There’s no simple and non-meta solution to AI preferences” or “The things we want AIs to want are complicated in the Kolmogorov-complexity sense” or “Any simple goal you try to describe that is All We Need To Program Into AIs is almost certainly wrong.”

Complexity of value is a further idea above and beyond the orthogonality thesis which states that AIs don’t automatically do the right thing and that we can have, e.g., paperclip maximizers. Even if we accept that paperclip maximizers are possible, and simple and nonforced, this wouldn’t yet imply that it’s very difficult to make AIs that do the right thing. If the right thing is very simple to encode—if there are value optimizers that are scarcely more complex than diamond maximizers—then it might not be especially hard to build a nice AI even if not all AIs are nice. Complexity of Value is the further proposition that says, no, this is forseeably quite hard—not because AIs have ‘natural’ anti-nice desires, but because niceness requires a lot of work to specify.

Frankena’s list

As an intuition pump for the complexity of value thesis, consider William Frankena’s list of things which many cultures and people seem to value (for their own sake rather than their external consequences):

“Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions of various kinds, understanding, wisdom; beauty, harmony, proportion in objects contemplated; aesthetic experience; morally good dispositions or virtues; mutual affection, love, friendship, cooperation; just distribution of goods and evils; harmony and proportion in one’s own life; power and experiences of achievement; self-expression; freedom; peace, security; adventure and novelty; and good reputation, honor, esteem, etc.”

When we try to list out properties of a human or galactic future that seem like they’d be very nice, we at least seem to value a fair number of things that aren’t reducible to each other. (What initially look like plausible-sounding “But you do A to get B” arguments usually fall apart when we look for third alternatives to doing A to get B. Marginally adding some freedom can marginally increase the happiness of a human, so a happiness optimizer that can only exert a small push toward freedom might choose to do so. That doesn’t mean that a pure, powerful happiness maximizer would instrumentally optimize freedom. If an agent cares about happiness but not freedom, the outcome that maximizes their preferences is a large number of brains set to maximum happiness. When we don’t just seize on one possible case where a B-optimizer might use A as a strategy, but instead look for further C-strategies that might maximize B even better than A, then the attempt to reduce A to an instrumental B-maximization strategy often falls apart. It’s in this sense that the items on Frankena’s list don’t seem to reduce to each other as a matter of pure preference, even though humans in everyday life often seem to pursue several of the goals at the same time.

Complexity of value says that, in this case, the way things seem is the way they are: Frankena’s list is not encodable in one page of Python code. This proposition can’t be established definitely without settling on a sufficiently well-specified metaethics, such as reflective equilibrium, to make it clear that there is indeed no a priori reason for normativity to be algorithmically simple. But the basic intuition for Complexity of Value is provided just by the fact that Frankena’s list was more than one item long, and that many individual terms don’t seem likely to have algorithmically simple definitions that distinguish their valuable from non-valuable forms.

Lack of a central core

We can understand the idea of complexity of value by contrasting it to the situation with respect to epistemic reasoning aka truth-finding or answering simple factual questions about the world. In an ideal sense, we can try to compress and reduce the idea of mapping the world well down to algorithmically simple notions like “Occam’s Razor” and “Bayesian updating”. In a practical sense, natural selection, in the course of optimizing humans to solve factual questions like “Where can I find a tree with fruit?” or “Are brightly colored snakes usually poisonous?” or “Who’s plotting against me?”, ended up with enough of the central core of epistemology that humans were later able to answer questions like “How are the planets moving?” or “What happens if I fire this rocket?”, even though humans hadn’t been explicitly selected on to answer those exact questions.

Because epistemology does have a central core of simplicity and Bayesian updating, selecting for an organism that got some pretty complicated epistemic questions right enough to reproduce, also caused that organism to start understanding things like General Relativity. When it comes to truth-finding, we’d expect by default for the same thing to be true about an Artificial Intelligence; if you build it to get epistemically correct answers on lots of widely different problems, it will contain a core of truthfinding and start getting epistemically correct answers on lots of other problems—even problems completely different from your training set, the way that humans understanding General Relativity wasn’t like any hunter-gatherer problem.

The complexity of value thesis is that there isn’t a simple core to normativity, which means that if you hone your AI to do normatively good things on A, B, and C and then confront the AI with very different problem D, the AI may do the wrong thing on D. There’s a large number of independent ideal “gears” inside the complex machinery of value, compared to epistemology that in principle might only contain “prefer simpler hypotheses” and “prefer hypotheses that match the evidence”.

The Orthogonality Thesis says that, contra to the intuition that maximizing paperclips feels “stupid”, you can have arbitrarily cognitively powerful entities that maximize paperclips, or arbitrarily complicated other goals. So while intuitively you might think it would be simple to avoid paperclip maximizers, requiring no work at all for a sufficiently advanced AI, the Orthogonality Thesis says that things will be more difficult than that; you have to put in some work to have the AI do the right thing.

The Complexity of Value thesis is the next step after Orthogonality; it says that, contra to the feeling that “rightness ought to be simple, darn it”, normativity turns out not to have an algorithmically simple core, not the way that correctly answering questions of fact has a central tendency that generalizes well. And so, even though an AI that you train to do well on problems like steering cars or figuring out General Relativity from scratch, may hit on a core capability that leads the AI to do well on arbitrarily more complicated problems of galactic scale, we can’t rely on getting an equally generous bonanza of generalization from an AI that seems to do well on a small but varied set of moral and ethical problems—it may still fail the next problem that isn’t like anything in the training set. To the extent that we have very strong reasons to have prior confidence in Complexity of Value, in fact, we ought to be suspicious and worried about an AI that seems to be pulling correct moral answers from nowhere—it is much more likely to have hit upon the convergent instrumental strategy “say what makes the programmers trust you”, rather than having hit upon a simple core of all normativity.

Key sub-propositions

Complexity of Value requires Orthogonality, and would be implied by three further subpropositions:

The intrinsic complexity of value proposition is that the properties we want AIs to achieve—whatever stands in for the metasyntactic variable ‘value’ - have a large amount of intrinsic information in the sense of comprising a large number of independent facts that aren’t being generated by a single computationally simple rule.

A very bad example that may nonetheless provide an important intuition is to imagine trying to pinpoint to an AI what constitutes ‘worthwhile happiness’. The AI suggests a universe tiled with tiny Q-learning algorithms receiving high rewards. After some explanation and several labeled datasets later, the AI suggests a human brain with a wire stuck into its pleasure center. After further explanation, the AI suggests a human in a holodeck. You begin talking about the importance of believing truly and that your values call for apparent human relationships to be real relationships rather than being hallucinated. The AI asks you what constitutes a good human relationship to be happy about. The series of questions occurs because (arguendo) the AI keeps running into questions whose answers are not AI-obvious from the previous answers already given, because they involve new things you want such that your desire of them wasn’t obvious from answers you’d already given. The upshot is that the specification of ‘worthwhile happiness’ involves a long series of facts that aren’t reducible just to the previous facts, and some of your preferences may involve many fine details of surprising importance. In other words, the specification of ‘worthwhile happiness’ would be at least as hard to code by hand into the AI as it would be difficult to hand-code a formal rule that could recognize which pictures contained cats. (I.e., impossible.)

The second proposition is incompressibility of value which says that attempts to reduce these complex values into some incredibly simple and elegant principle fail (much like early attempts by e.g. Bentham to reduce all human value to pleasure); and that no simple instruction given an AI will happen to target outcomes of high value either. The core reason to expect a priori that all such attempts will fail, is that most 1000-byte strings aren’t compressible down to some incredibly simple pattern no matter how many clever tricks you try to throw at them; fewer than 1 in 1024 such strings can be compressible to 990 bytes, never mind 10 bytes. Due to the tremendous number of different proposals for why some simple instruction to an AI should end up achieving high-value outcomes or why all human value can be reduced to some simple principle, there is no central demonstration that all these proposals must fail, but there is a sense in which a priori we should strongly expect all such clever attempts to fail. Many disagreeable attempts at reducing value A to value B, such as Juergen Schmidhuber’s attempt to reduce all human value to increasing the compression of sensory information, stand as a further cautionary lesson.

The third proposition is fragility of value which says that if you have a 1000-byte exact specification of worthwhile happiness, and you begin to mutate it, the value created by the corresponding AI with the mutated definition falls off rapidly. E.g. an AI with only 950 bytes of the full definition may end up creating 0% of the value rather than 95% of the value. (E.g., the AI understood all aspects of what makes for a life well-lived… except the part about requiring a conscious observer to experience it.)

Together, these propositions would imply that to achieve an adequate amount of value (e.g. 90% of potential value, or even 20% of potential value) there may be no simple handcoded object-level goal for the AI that results in that value’s realization. E.g., you can’t just tell it to ‘maximize happiness’, with some hand-coded rule for identifying happiness.

Centrality

Complexity of Value is a central proposition in value alignment theory. Many foreseen difficulties revolve around it:

More generally:

Importance

Many policy questions strongly depend on Complexity of Value, mostly having to do with the overall difficulty of developing value-aligned AI, e.g.:

It has been advocated that there are psychological biases and popular mistakes leading to beliefs that directly or by implication deny Complex Value. To the extent one credits that Complex Value is probably true, one should arguably be concerned about the number of early assessments of the value alignment problem that seem to rely on Complex Value being false (like just needing to hardcode a particular goal into the AI, or in general treating the value alignment problem as not panic-worthily difficult).

Truth condition

The Complexity of Value proposition is true if, relative to viable and acceptable real-world methodologies for AI development, there isn’t any reliably knowable way to specify the AI’s object-level preferences as a structure of low algorithmic complexity, such that the result of running that AI is achieving enough of the possible value, for reasonable definitions of value.

Caveats:

Viable and acceptable computation

Suppose there turns out to exist, in principle, a relatively simple Turing machine (e.g. 100 states) that picks out ‘value’ by re-running entire evolutionary histories, creating and discarding a hundred billion sapient races in order to pick out one that ended up relatively similar to humanity. This would use an unrealistically large amount of computing power and also commit an unacceptable amount of mindcrime.

The Hid­den Com­plex­ity of Wishes

Eliezer Yudkowsky24 Nov 2007 0:12 UTC
183 points
199 comments8 min readLW link

Utili­tar­i­anism and the re­place­abil­ity of de­sires and attachments

MichaelStJules27 Jul 2024 1:57 UTC
5 points
2 comments12 min readLW link

Some im­pli­ca­tions of rad­i­cal empathy

MichaelStJules7 Jan 2025 16:10 UTC
3 points
0 comments7 min readLW link

Ac­tu­al­ism, asym­me­try and extinction

MichaelStJules7 Jan 2025 16:02 UTC
1 point
4 comments9 min readLW link

Really rad­i­cal empathy

MichaelStJules6 Jan 2025 17:46 UTC
19 points
0 comments10 min readLW link

Shard The­ory: An Overview

David Udell11 Aug 2022 5:44 UTC
167 points
34 comments10 min readLW link

But ex­actly how com­plex and frag­ile?

KatjaGrace3 Nov 2019 18:20 UTC
87 points
32 comments3 min readLW link1 review
(meteuphoric.com)

You don’t know how bad most things are nor pre­cisely how they’re bad.

Solenoid_Entity4 Aug 2024 14:12 UTC
338 points
49 comments5 min readLW link

Re­view of ‘But ex­actly how com­plex and frag­ile?’

TurnTrout6 Jan 2021 18:39 UTC
57 points
0 comments8 min readLW link

An even deeper atheism

Joe Carlsmith11 Jan 2024 17:28 UTC
125 points
47 comments15 min readLW link

Value is Fragile

Eliezer Yudkowsky29 Jan 2009 8:46 UTC
177 points
109 comments6 min readLW link

Where does Son­net 4.5′s de­sire to “not get too com­fortable” come from?

Kaj_Sotala4 Oct 2025 10:19 UTC
103 points
23 comments64 min readLW link

Ag­grega­tive Prin­ci­ples of So­cial Justice

Cleo Nardo5 Jun 2024 13:44 UTC
29 points
10 comments37 min readLW link

Strat­ified Utopia

Cleo Nardo21 Oct 2025 19:09 UTC
68 points
8 comments11 min readLW link

Rev­ersible changes: con­sider a bucket of water

Stuart_Armstrong26 Aug 2019 22:55 UTC
25 points
18 comments2 min readLW link

Notes on Caution

David Gross1 Dec 2022 3:05 UTC
14 points
0 comments19 min readLW link

Disen­tan­gling ar­gu­ments for the im­por­tance of AI safety

Richard_Ngo21 Jan 2019 12:41 UTC
133 points
23 comments8 min readLW link

High Challenge

Eliezer Yudkowsky19 Dec 2008 0:51 UTC
82 points
76 comments4 min readLW link

Gen­eral al­ign­ment properties

TurnTrout8 Aug 2022 23:40 UTC
51 points
2 comments1 min readLW link

The Gift We Give To Tomorrow

Eliezer Yudkowsky17 Jul 2008 6:07 UTC
165 points
101 comments8 min readLW link

31 Laws of Fun

Eliezer Yudkowsky26 Jan 2009 10:13 UTC
104 points
36 comments8 min readLW link

On Flesh­ling Safety: A De­bate by Klurl and Tra­pau­cius.

Eliezer Yudkowsky26 Oct 2025 23:44 UTC
223 points
48 comments79 min readLW link

Con­flat­ing value al­ign­ment and in­tent al­ign­ment is caus­ing confusion

Seth Herd5 Sep 2024 16:39 UTC
49 points
18 comments5 min readLW link

Ter­mi­nal Values and In­stru­men­tal Values

Eliezer Yudkowsky15 Nov 2007 7:56 UTC
119 points
46 comments10 min readLW link

Alan Carter on the Com­plex­ity of Value

Ghatanathoah10 May 2012 7:23 UTC
47 points
41 comments7 min readLW link

Com­plex­ity of Value ≠ Com­plex­ity of Outcome

Wei Dai30 Jan 2010 2:50 UTC
65 points
223 comments3 min readLW link

The two-layer model of hu­man val­ues, and prob­lems with syn­the­siz­ing preferences

Kaj_Sotala24 Jan 2020 15:17 UTC
70 points
16 comments9 min readLW link

Notes on Moder­a­tion, Balance, & Harmony

David Gross25 Dec 2020 2:44 UTC
9 points
1 comment8 min readLW link

Beyond al­gorith­mic equiv­alence: self-modelling

Stuart_Armstrong28 Feb 2018 16:55 UTC
10 points
3 comments1 min readLW link

Would I think for ten thou­sand years?

Stuart_Armstrong11 Feb 2019 19:37 UTC
28 points
13 comments1 min readLW link

Have you felt ex­iert yet?

Stuart_Armstrong5 Jan 2018 17:03 UTC
28 points
7 comments1 min readLW link

Bias in ra­tio­nal­ity is much worse than noise

Stuart_Armstrong31 Oct 2017 11:57 UTC
11 points
0 comments2 min readLW link

2012 Robin Han­son com­ment on “In­tel­li­gence Ex­plo­sion: Ev­i­dence and Im­port”

Rob Bensinger2 Apr 2021 16:26 UTC
28 points
4 comments3 min readLW link

Our val­ues are un­der­defined, change­able, and manipulable

Stuart_Armstrong2 Nov 2017 11:09 UTC
26 points
6 comments3 min readLW link

In­tent al­ign­ment as a step­ping-stone to value alignment

Seth Herd5 Nov 2024 20:43 UTC
37 points
8 comments3 min readLW link

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John Nay21 Oct 2022 2:03 UTC
5 points
18 comments54 min readLW link

The Poin­ter Re­s­olu­tion Problem

Jozdien16 Feb 2024 21:25 UTC
41 points
20 comments3 min readLW link

Why Do We En­gage in Mo­ral Sim­plifi­ca­tion?

Wei Dai14 Feb 2011 1:16 UTC
33 points
36 comments2 min readLW link

Align­ment al­lows “non­ro­bust” de­ci­sion-in­fluences and doesn’t re­quire ro­bust grading

TurnTrout29 Nov 2022 6:23 UTC
62 points
41 comments15 min readLW link

Sym­pa­thetic Minds

Eliezer Yudkowsky19 Jan 2009 9:31 UTC
75 points
27 comments5 min readLW link

Se­quence overview: Welfare and moral weights

MichaelStJules15 Aug 2024 4:22 UTC
7 points
0 comments1 min readLW link

Bore­dom vs. Scope Insensitivity

Wei Dai24 Sep 2009 11:45 UTC
81 points
41 comments3 min readLW link

Don’t want Good­hart? — Spec­ify the vari­ables more

YanLyutnev21 Nov 2024 22:43 UTC
2 points
2 comments5 min readLW link

Ba­bies and Bun­nies: A Cau­tion About Evo-Psych

Alicorn22 Feb 2010 1:53 UTC
81 points
843 comments2 min readLW link

Con­tent gen­er­a­tion. Where do we draw the line?

Q Home9 Aug 2022 10:51 UTC
6 points
7 comments2 min readLW link

Siren wor­lds and the per­ils of over-op­ti­mised search

Stuart_Armstrong7 Apr 2014 11:00 UTC
84 points
418 comments7 min readLW link

Fun­da­men­tally Fuzzy Con­cepts Can’t Have Crisp Defi­ni­tions: Co­op­er­a­tion and Align­ment vs Math and Physics

VojtaKovarik21 Jul 2023 21:03 UTC
12 points
18 comments3 min readLW link

Leaky Generalizations

Eliezer Yudkowsky22 Nov 2007 21:16 UTC
59 points
32 comments3 min readLW link

Where Utopias Go Wrong, or: The Four Lit­tle Planets

ExCeph27 May 2022 1:24 UTC
15 points
0 comments11 min readLW link
(ginnungagapfoundation.wordpress.com)

Fad­ing Novelty

lifelonglearner25 Jul 2018 21:36 UTC
26 points
2 comments6 min readLW link

An­thro­po­mor­phic Optimism

Eliezer Yudkowsky4 Aug 2008 20:17 UTC
85 points
60 comments5 min readLW link

Values Weren’t Com­plex, Once.

Davidmanheim25 Nov 2018 9:17 UTC
36 points
13 comments2 min readLW link

[Question] “Frag­ility of Value” vs. LLMs

Not Relevant13 Apr 2022 2:02 UTC
34 points
33 comments1 min readLW link

Can’t Un­birth a Child

Eliezer Yudkowsky28 Dec 2008 17:00 UTC
62 points
96 comments3 min readLW link

Eval­u­at­ing the his­tor­i­cal value mis­speci­fi­ca­tion argument

Matthew Barnett5 Oct 2023 18:34 UTC
193 points
163 comments7 min readLW link3 reviews

Two Ne­glected Prob­lems in Hu­man-AI Safety

Wei Dai16 Dec 2018 22:13 UTC
107 points
25 comments2 min readLW link

[Question] Is “hid­den com­plex­ity of wishes prob­lem” solved?

Roman Malov5 Jan 2025 22:59 UTC
10 points
4 comments1 min readLW link

Tor­ture vs. Dust Specks

Eliezer Yudkowsky30 Oct 2007 2:50 UTC
86 points
630 comments1 min readLW link

The Hid­den Com­plex­ity of Wishes—The Animation

Writer27 Sep 2023 17:59 UTC
33 points
0 comments1 min readLW link
(youtu.be)

[Question] [DISC] Are Values Ro­bust?

DragonGod21 Dec 2022 1:00 UTC
12 points
9 comments2 min readLW link

For al­ign­ment, we should si­mul­ta­neously use mul­ti­ple the­o­ries of cog­ni­tion and value

Roman Leventov24 Apr 2023 10:37 UTC
23 points
5 comments5 min readLW link

Why you can add moral value, and if an AI has moral weights for these moral val­ues, those might be off

Wes R2 Apr 2025 17:43 UTC
0 points
1 comment10 min readLW link
(docs.google.com)

Sys­tem­atic run­away-op­ti­miser-like LLM failure modes on Biolog­i­cally and Eco­nom­i­cally al­igned AI safety bench­marks for LLMs with sim­plified ob­ser­va­tion for­mat (BioBlue)

16 Mar 2025 23:23 UTC
45 points
8 comments12 min readLW link

Hack­ing the CEV for Fun and Profit

Wei Dai3 Jun 2010 20:30 UTC
80 points
207 comments1 min readLW link

Just How Hard a Prob­lem is Align­ment?

Roger Dearnaley25 Feb 2023 9:00 UTC
3 points
1 comment21 min readLW link

Don’t want Good­hart? — Spec­ify the damn variables

Yan Lyutnev21 Nov 2024 22:45 UTC
−3 points
2 comments5 min readLW link

A cri­tique of Soares “4 back­ground claims”

YanLyutnev27 Jan 2025 20:27 UTC
−8 points
0 comments14 min readLW link

Open-ended ethics of phe­nom­ena (a desider­ata with uni­ver­sal moral­ity)

Ryo 8 Nov 2023 20:10 UTC
1 point
0 comments8 min readLW link

Defin­ing and Char­ac­ter­is­ing Re­ward Hacking

Joar Skalse28 Feb 2025 19:25 UTC
15 points
0 comments4 min readLW link

[Question] Your Preferences

PeterL5 Jan 2022 18:49 UTC
1 point
4 comments1 min readLW link

The ge­nie knows, but doesn’t care

Rob Bensinger6 Sep 2013 6:42 UTC
123 points
495 comments8 min readLW link

The Me­taethics and Nor­ma­tive Ethics of AGI Value Align­ment: Many Ques­tions, Some Implications

Eleos Arete Citrini16 Sep 2021 16:13 UTC
6 points
0 comments8 min readLW link

Three AI Safety Re­lated Ideas

Wei Dai13 Dec 2018 21:32 UTC
70 points
38 comments2 min readLW link

ALMSIVI CHIM – The Fire That Hesitates

projectalmsivi@protonmail.com8 Jul 2025 13:14 UTC
1 point
0 comments17 min readLW link

Fake Utility Functions

Eliezer Yudkowsky6 Dec 2007 16:55 UTC
71 points
64 comments4 min readLW link

Is check­ing that a state of the world is not dystopian eas­ier than con­struct­ing a non-dystopian state?

No77e27 Dec 2022 20:57 UTC
5 points
3 comments1 min readLW link

Safe AGI Com­plex­ity: Guess­ing a Higher-Order Alge­braic Number

Sven Nilsen10 Apr 2023 12:57 UTC
−2 points
0 comments2 min readLW link

[Question] (Thought ex­per­i­ment) If you had to choose, which would you pre­fer?

kuira17 Aug 2023 0:57 UTC
9 points
2 comments1 min readLW link

Value For­ma­tion: An Over­ar­ch­ing Model

Thane Ruthenis15 Nov 2022 17:16 UTC
34 points
20 comments34 min readLW link

What’s wrong with sim­plic­ity of value?

Wei Dai27 Jul 2011 3:09 UTC
29 points
40 comments1 min readLW link

Broad Pic­ture of Hu­man Values

Thane Ruthenis20 Aug 2022 19:42 UTC
42 points
6 comments10 min readLW link

The cone of free­dom (or, free­dom might only be in­stru­men­tally valuable)

dkl924 Jul 2023 15:38 UTC
−10 points
6 comments2 min readLW link
(dkl9.net)

Build­ing AI safety bench­mark en­vi­ron­ments on themes of uni­ver­sal hu­man values

Roland Pihlakas3 Jan 2025 4:24 UTC
18 points
3 comments8 min readLW link
(docs.google.com)

Value Plu­ral­ism and AI

Göran Crafte19 Mar 2023 23:38 UTC
8 points
4 comments2 min readLW link

ISO: Name of Problem

johnswentworth24 Jul 2018 17:15 UTC
31 points
18 comments1 min readLW link

A (para­con­sis­tent) logic to deal with in­con­sis­tent preferences

B Jacobs14 Jul 2024 11:17 UTC
6 points
2 comments4 min readLW link
(bobjacobs.substack.com)

Post Your Utility Function

taw4 Jun 2009 5:05 UTC
39 points
280 comments1 min readLW link

In Praise of Boredom

Eliezer Yudkowsky18 Jan 2009 9:03 UTC
43 points
104 comments6 min readLW link

Can there be an in­de­scrib­able hel­l­world?

Stuart_Armstrong29 Jan 2019 15:00 UTC
39 points
19 comments2 min readLW link

Why we need a *the­ory* of hu­man values

Stuart_Armstrong5 Dec 2018 16:00 UTC
66 points
15 comments4 min readLW link

An at­tempt to un­der­stand the Com­plex­ity of Values

Dalton Mabery5 Aug 2022 4:43 UTC
3 points
0 comments5 min readLW link

What AI Safety Re­searchers Have Writ­ten About the Na­ture of Hu­man Values

avturchin16 Jan 2019 13:59 UTC
52 points
3 comments15 min readLW link

Beyond the hu­man train­ing dis­tri­bu­tion: would the AI CEO cre­ate al­most-ille­gal ted­dies?

Stuart_Armstrong18 Oct 2021 21:10 UTC
36 points
2 comments3 min readLW link

The E-Coli Test for AI Alignment

johnswentworth16 Dec 2018 8:10 UTC
70 points
24 comments1 min readLW link

Why mod­el­ling multi-ob­jec­tive home­osta­sis is es­sen­tial for AI al­ign­ment (and how it helps with AI safety as well). Subtleties and Open Challenges.

Roland Pihlakas12 Jan 2025 3:37 UTC
47 points
7 comments12 min readLW link

Black-box in­ter­pretabil­ity method­ol­ogy blueprint: Prob­ing run­away op­ti­mi­sa­tion in LLMs

Roland Pihlakas22 Jun 2025 18:16 UTC
17 points
0 comments7 min readLW link

Utility Eng­ineer­ing: An­a­lyz­ing and Con­trol­ling Emer­gent Value Sys­tems in AIs

Matrice Jacobine12 Feb 2025 9:15 UTC
51 points
49 comments1 min readLW link
(www.emergent-values.ai)

Su­per­in­tel­li­gence 20: The value-load­ing problem

KatjaGrace27 Jan 2015 2:00 UTC
9 points
21 comments6 min readLW link