Impact Regularization

TagLast edit: 30 Dec 2024 9:57 UTC by Dakara

Impact Regularizers penalize an AI for affecting us too much. To reduce the risk posed by a powerful AI, you might want to make it try accomplish its goals with as little impact on the world as possible. You reward the AI for crossing a room; to maximize time-discounted total reward, the optimal policy makes a huge mess as it sprints to the other side.

How do you rigorously define “low impact” in a way that a computer can understand – how do you measure impact? These questions are important for both prosaic and future AI systems: objective specification is hard; we don’t want AI systems to rampantly disrupt their environment. In the limit of goal-directed intelligence, theorems suggest that seeking power tends to be optimal; we don’t want highly capable AI systems to permanently wrench control of the future from us.

Currently, impact regularization research focuses on two approaches:

Relative reachability: the AI preserves its ability to reach many kinds of world-states. The hope is that by staying able to reach many goal states, the AI stays able to reach the correct goal state.
Attainable utility preservation: the AI preserves its ability to achieve one or more auxiliary goals. The hope is that by penalizing gaining or losing control over the future, the AI doesn’t take control away from us.

For a review of earlier work, see A Survey of Early Impact Measures.

Sequences on impact regularization:

Reframing Impact: we’re impacted when we become more or less able to achieve our goals. Seemingly, goal-directed AI systems are only incentivized to catastrophically impact us in order to gain power to achieve their own goals. To avoid catastrophic impact, what if we penalize the AI for gaining power?
Subagents and Impact Measures explores how subagents can circumvent current impact measure formalizations.

Related tags: Instrumental Convergence, Corrigibility, Mild Optimization.

Reframing Impact

TurnTrout20 Sep 2019 19:03 UTC

98 points

16 comments1 min readLW link 1 review

Attainable Utility Preservation: Concepts

TurnTrout17 Feb 2020 5:20 UTC

38 points

20 comments1 min readLW link

Tradeoff between desirable properties for baseline choices in impact measures

Vika4 Jul 2020 11:56 UTC

37 points

24 comments5 min readLW link

Towards a New Impact Measure

TurnTrout18 Sep 2018 17:21 UTC

103 points

159 comments33 min readLW link 2 reviews

Impact measurement and value-neutrality verification

evhub15 Oct 2019 0:06 UTC

31 points

13 comments6 min readLW link

[Question] Best reasons for pessimism about impact of impact measures?

TurnTrout10 Apr 2019 17:22 UTC

60 points

55 comments3 min readLW link

World State is the Wrong Abstraction for Impact

TurnTrout1 Oct 2019 21:03 UTC

68 points

19 comments2 min readLW link

The Catastrophic Convergence Conjecture

TurnTrout14 Feb 2020 21:16 UTC

45 points

16 comments8 min readLW link

Attainable Utility Landscape: How The World Is Changed

TurnTrout10 Feb 2020 0:58 UTC

52 points

7 comments6 min readLW link

Designing agent incentives to avoid side effects

Vika and TurnTrout

11 Mar 2019 20:55 UTC

29 points

0 comments2 min readLW link

(medium.com)

Conclusion to ‘Reframing Impact’

TurnTrout28 Feb 2020 16:05 UTC

47 points

18 comments2 min readLW link

Deducing Impact

TurnTrout24 Sep 2019 21:14 UTC

72 points

28 comments1 min readLW link

Reasons for Excitement about Impact of Impact Measure Research

TurnTrout27 Feb 2020 21:42 UTC

33 points

8 comments4 min readLW link

Worrying about the Vase: Whitelisting

TurnTrout16 Jun 2018 2:17 UTC

73 points

26 comments11 min readLW link

How Low Should Fruit Hang Before We Pick It?

TurnTrout25 Feb 2020 2:08 UTC

28 points

9 comments12 min readLW link

The Gears of Impact

TurnTrout7 Oct 2019 14:44 UTC

54 points

16 comments1 min readLW link

Attainable Utility Theory: Why Things Matter

TurnTrout27 Sep 2019 16:48 UTC

73 points

24 comments1 min readLW link

Attainable Utility Preservation: Empirical Results

TurnTrout and nealeratzlaff

22 Feb 2020 0:38 UTC

66 points

8 comments10 min readLW link 1 review

Attainable Utility Preservation: Scaling to Superhuman

TurnTrout27 Feb 2020 0:52 UTC

28 points

22 comments8 min readLW link

Value Impact

TurnTrout23 Sep 2019 0:47 UTC

70 points

10 comments1 min readLW link

AXRP Episode 11 - Attainable Utility and Power with Alex Turner

DanielFilan25 Sep 2021 21:10 UTC

19 points

5 comments53 min readLW link

Learning preferences by looking at the world

Rohin Shah12 Feb 2019 22:25 UTC

43 points

10 comments7 min readLW link

(bair.berkeley.edu)

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC

35 points

3 comments15 min readLW link

Four Ways An Impact Measure Could Help Alignment

Matthew Barnett8 Aug 2019 0:10 UTC

21 points

1 comment9 min readLW link

Impact Measure Desiderata

TurnTrout2 Sep 2018 22:21 UTC

36 points

41 comments5 min readLW link

A Survey of Early Impact Measures

Matthew Barnett6 Aug 2019 1:22 UTC

29 points

0 comments8 min readLW link

[Question] Could there be “natural impact regularization” or “impact regularization by default”?

tailcalled1 Dec 2023 22:01 UTC

28 points

6 comments1 min readLW link

Appendix: mathematics of indexical impact measures

Stuart_Armstrong17 Feb 2020 13:22 UTC

12 points

0 comments4 min readLW link

AXRP Episode 7 - Side Effects with Victoria Krakovna

DanielFilan14 May 2021 3:50 UTC

34 points

6 comments43 min readLW link

Alex Turner’s Research, Comprehensive Information Gathering

adamShimi23 Jun 2021 9:44 UTC

15 points

3 comments3 min readLW link

Avoiding Side Effects in Complex Environments

TurnTrout and nealeratzlaff

12 Dec 2020 0:34 UTC

62 points

12 comments2 min readLW link

(avoiding-side-effects.github.io)

Reversible changes: consider a bucket of water

Stuart_Armstrong26 Aug 2019 22:55 UTC

25 points

18 comments2 min readLW link

A Critique of Non-Obstruction

Joe Collman3 Feb 2021 8:45 UTC

13 points

9 comments4 min readLW link

Appendix: how a subagent could get powerful

Stuart_Armstrong28 Jan 2020 15:28 UTC

53 points

14 comments4 min readLW link

Test Cases for Impact Regularisation Methods

DanielFilan6 Feb 2019 21:50 UTC

72 points

5 comments13 min readLW link

(danielfilan.com)

Overcoming Clinginess in Impact Measures

TurnTrout30 Jun 2018 22:51 UTC

30 points

9 comments7 min readLW link

Subagents and impact measures, full and fully illustrated

Stuart_Armstrong24 Feb 2020 13:12 UTC

31 points

14 comments17 min readLW link

Understanding Recent Impact Measures

Matthew Barnett7 Aug 2019 4:57 UTC

16 points

6 comments7 min readLW link

Why is the impact penalty time-inconsistent?

Stuart_Armstrong9 Jul 2020 17:26 UTC

16 points

1 comment2 min readLW link

AI Alignment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC

126 points

6 comments35 min readLW link

Dynamic inconsistency of the inaction and initial state baseline

Stuart_Armstrong7 Jul 2020 12:02 UTC

30 points

8 comments2 min readLW link

Penalizing Impact via Attainable Utility Preservation

TurnTrout28 Dec 2018 21:46 UTC

20 points

0 comments3 min readLW link

(arxiv.org)

Hedonic Loops and Taming RL

beren19 Jul 2023 15:12 UTC

20 points

14 comments9 min readLW link

Announcement: AI alignment prize round 4 winners

cousin_it20 Jan 2019 14:46 UTC

74 points

41 comments1 min readLW link

SAAP: Is Deliberate Structural Inefficiency the Inevitable Cost of AGI Alignment?

Articus1930 Nov 2025 17:45 UTC

1 point

0 comments1 min readLW link

[AN #68]: The attainable utility theory of impact

Rohin Shah14 Oct 2019 17:00 UTC

17 points

0 comments8 min readLW link

(mailchi.mp)

Yudkowsky on AGI ethics

Rob Bensinger19 Oct 2017 23:13 UTC

73 points

5 comments2 min readLW link

Asymptotically Unambitious AGI

michaelcohen10 Apr 2020 12:31 UTC

50 points

217 comments2 min readLW link

Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns

Roland Pihlakas28 Dec 2025 21:53 UTC

14 points

0 comments8 min readLW link

A Block-Based Regularization Proposal for Neural Networks

Otto.Dev19 Apr 2025 18:56 UTC

−8 points

0 comments1 min readLW link

SAAP: A Normative AGI Architecture for Safety using Dual-Process Control and Human Sovereignty

Articus1930 Nov 2025 17:58 UTC

1 point

0 comments1 min readLW link

[Question] “Do Nothing” utility function, 3½ years later?

niplav20 Jul 2020 11:09 UTC

5 points

3 comments1 min readLW link

Simplified preferences needed; simplified preferences sufficient

Stuart_Armstrong5 Mar 2019 19:39 UTC

33 points

6 comments3 min readLW link

Open Problems in Negative Side Effect Minimization

Fabian Schimpf and Lukas Fluri

6 May 2022 9:37 UTC

12 points

6 comments17 min readLW link

Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well). Subtleties and open challenges.

Roland Pihlakas12 Jan 2025 3:37 UTC

48 points

7 comments15 min readLW link

No comments.