Inverse Reinforcement Learning

TagLast edit: 30 Dec 2024 9:58 UTC by Dakara

Inverse Reinforcement Learning (IRL) is a technique in the field of machine learning where an AI system learns the preferences or objectives of an agent, typically a human, by observing their behavior. Unlike traditional Reinforcement Learning (RL), where an agent learns to optimize its actions based on given reward functions, IRL works by inferring the underlying reward function from the demonstrated behavior.

In other words, IRL aims to understand the motivations and goals of an agent by examining their actions in various situations. Once the AI system has learned the inferred reward function, it can then use this information to make decisions that align with the preferences or objectives of the observed agent.

IRL is particularly relevant in the context of AI alignment, as it provides a potential approach to align AI systems with human values. By learning from human demonstrations, AI systems can be designed to better understand and respect the preferences, intentions, and values of the humans they interact with or serve.

Thoughts on “Human-Compatible”

TurnTrout10 Oct 2019 5:24 UTC

64 points

34 comments5 min readLW link

Model Mis-specification and Inverse Reinforcement Learning

Owain_Evans and jsteinhardt

9 Nov 2018 15:33 UTC

34 points

3 comments16 min readLW link

Our take on CHAI’s research agenda in under 1500 words

Alex Flint17 Jun 2020 12:24 UTC

113 points

18 comments5 min readLW link

Learning biases and rewards simultaneously

Rohin Shah6 Jul 2019 1:45 UTC

41 points

3 comments4 min readLW link

A Survey of Foundational Methods in Inverse Reinforcement Learning

adamk1 Sep 2022 18:21 UTC

28 points

0 comments12 min readLW link

Delegative Inverse Reinforcement Learning

Vanessa Kosoy12 Jul 2017 12:18 UTC

15 points

13 comments16 min readLW link

Value Learning Needs a Low-Dimensional Bottleneck

Gunnar_Zarncke23 Jan 2026 2:12 UTC

24 points

7 comments1 min readLW link

Biased reward-learning in CIRL

Stuart_Armstrong5 Jan 2018 18:12 UTC

8 points

3 comments7 min readLW link

AXRP Episode 8 - Assistance Games with Dylan Hadfield-Menell

DanielFilan8 Jun 2021 23:20 UTC

22 points

1 comment72 min readLW link

IRL 1/8: Inverse Reinforcement Learning and the problem of degeneracy

RAISE4 Mar 2019 13:11 UTC

20 points

2 comments1 min readLW link

(app.grasple.com)

CIRL Wireheading

tom4everitt8 Aug 2017 6:33 UTC

3 points

4 comments2 min readLW link

Problems integrating decision theory and inverse reinforcement learning

agilecaveman8 May 2018 5:11 UTC

7 points

2 comments3 min readLW link

[Question] Can coherent extrapolated volition be estimated with Inverse Reinforcement Learning?

Jade Bishop15 Apr 2019 3:23 UTC

12 points

5 comments3 min readLW link

[Question] Is CIRL a promising agenda?

Chris_Leong23 Jun 2022 17:12 UTC

28 points

17 comments1 min readLW link

Inverse reinforcement learning on self, pre-ontology-change

Stuart_Armstrong18 Nov 2015 13:23 UTC

0 points

2 comments1 min readLW link

My take on Michael Littman on “The HCI of HAI”

Alex Flint2 Apr 2021 19:51 UTC

59 points

4 comments7 min readLW link

Cooperative Inverse Reinforcement Learning vs. Irrational Human Preferences

orthonormal18 Jun 2016 0:55 UTC

17 points

2 comments3 min readLW link

Book Review: Human Compatible

Scott Alexander31 Jan 2020 5:20 UTC

78 points

6 comments16 min readLW link

(slatestarcodex.com)

(C)IRL is not solely a learning process

Stuart_Armstrong15 Sep 2016 8:35 UTC

1 point

29 comments3 min readLW link

Humans can be assigned any values whatsoever...

Stuart_Armstrong24 Oct 2017 12:03 UTC

3 points

1 comment4 min readLW link

Book review: Human Compatible

PeterMcCluskey19 Jan 2020 3:32 UTC

37 points

2 comments5 min readLW link

(www.bayesianinvestor.com)

Unsupervised Agent Discovery

Gunnar_Zarncke22 Dec 2025 22:01 UTC

32 points

0 comments6 min readLW link

AXRP Episode 2 - Learning Human Biases with Rohin Shah

DanielFilan29 Dec 2020 20:43 UTC

13 points

0 comments35 min readLW link

[Linkpost] Concept Alignment as a Prerequisite for Value Alignment

Bogdan Ionut Cirstea4 Nov 2023 17:34 UTC

27 points

0 comments1 min readLW link

(arxiv.org)

# Emotion Is Structure: Toward Recursive Alignment Through Human–AI Co-Creation

thesignalthatcouldntbeheard3 Aug 2025 5:19 UTC

1 point

0 comments3 min readLW link

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar Skalse28 Feb 2025 19:20 UTC

31 points

5 comments14 min readLW link

Why do we need RLHF? Imitation, Inverse RL, and the role of reward

Ran W3 Feb 2024 4:00 UTC

16 points

0 comments5 min readLW link

ACI#9: What is Intelligence

Akira Pyinya9 Dec 2024 21:54 UTC

3 points

0 comments8 min readLW link

Humans can be assigned any values whatsoever...

Stuart_Armstrong13 Oct 2017 11:29 UTC

16 points

6 comments4 min readLW link

Hardcode the AGI to need our approval indefinitely?

MichaelStJules11 Nov 2021 7:04 UTC

2 points

2 comments1 min readLW link

Human-AI Collaboration

Rohin Shah22 Oct 2019 6:32 UTC

42 points

7 comments2 min readLW link

(bair.berkeley.edu)

Machines vs Memes Part 3: Imitation and Memes

ceru231 Jun 2022 13:36 UTC

7 points

0 comments7 min readLW link

Defining and Characterising Reward Hacking

Joar Skalse28 Feb 2025 19:25 UTC

15 points

0 comments4 min readLW link

Misspecification in Inverse Reinforcement Learning

Joar Skalse28 Feb 2025 19:24 UTC

19 points

0 comments11 min readLW link

RAISE is launching their MVP

null26 Feb 2019 11:45 UTC

67 points

1 comment1 min readLW link

How to Contribute to Theoretical Reward Learning Research

Joar Skalse28 Feb 2025 19:27 UTC

17 points

1 comment21 min readLW link

On Learning, Longing, and All the Things We Cannot Name

Hanyuan (Blake) Jiang24 Feb 2026 7:39 UTC

1 point

0 comments24 min readLW link

Agents That Learn From Human Behavior Can’t Learn Human Values That Humans Haven’t Learned Yet

steven046111 Jul 2018 2:59 UTC

29 points

11 comments1 min readLW link

Data for IRL: What is needed to learn human values?

Jan Wehner3 Oct 2022 9:23 UTC

18 points

6 comments12 min readLW link

Partial Identifiability in Reward Learning

Joar Skalse28 Feb 2025 19:23 UTC

16 points

0 comments12 min readLW link

Other Papers About the Theory of Reward Learning

Joar Skalse28 Feb 2025 19:26 UTC

16 points

0 comments5 min readLW link

Misspecification in Inverse Reinforcement Learning—Part II

Joar Skalse28 Feb 2025 19:24 UTC

9 points

0 comments7 min readLW link

No comments.