RSS

Re­in­force­ment Learning

TagLast edit: 26 Nov 2021 14:17 UTC by Multicore

Within the field of Machine Learning, reinforcement learning refers to the study of how an agent should choose its actions within an environment in order to maximize some kind of reward. Strongly inspired by the work developed in behavioral psychology it is essentially a trial and error approach to find the best strategy.

Related: Inverse Reinforcement Learning, Machine learning, Friendly AI, Game Theory, Prediction

Consider an agent that receives an input informing the agent of the environment’s state. Based only on that information, the agent has to make a decision regarding which action to take, from a set, which will influence the state of the environment. This action will in itself change the state of the environment, which will result in a new input, and so on, each time also presenting the agent with the reward relative to its actions in the environment. The agent’s goal is then to find the ideal strategy which will give the highest reward expectations over time, based on previous experience.

Exploration and Optimization

Knowing that randomly selecting the actions will result in poor performances, one of the biggest problems in reinforcement learning is exploring the avaliable set of responses to avoid getting stuck in sub-optimal choices and proceed to better ones.

This is the problem of exploration, which is best described in the most studied reinforcement learning problem—the k-armed bandit. In it, an agent has to decide which sequence of levers to pull in a gambling room, not having any information about the probabilities of winning in each machine besides the reward it receives each time. The problem revolves about deciding which is the optimal lever and what criteria defines the lever as such.

Parallel with an exploration implementation, it is still necessary to chose the criteria which makes a certain action optimal when compared to another. This study of this property has led to several methods, from brute forcing to taking into account temporal differences in the received reward. Despite this and the great results obtained by reinforcement methods in solving small problems, it suffers from a lack of scalability, having difficulties solving larger, close-to-human scenarios.

Further Reading & References

See Also

Re­ward is not the op­ti­miza­tion target

TurnTrout25 Jul 2022 0:03 UTC
245 points
96 comments10 min readLW link

Draft pa­pers for REALab and De­cou­pled Ap­proval on tampering

28 Oct 2020 16:01 UTC
47 points
2 comments1 min readLW link

Effi­cien­tZero: How It Works

1a3orn26 Nov 2021 15:17 UTC
271 points
42 comments29 min readLW link

Re­mak­ing Effi­cien­tZero (as best I can)

Hoagy4 Jul 2022 11:03 UTC
33 points
9 comments22 min readLW link

Book Re­view: Re­in­force­ment Learn­ing by Sut­ton and Barto

billmei20 Oct 2020 19:40 UTC
52 points
3 comments10 min readLW link

Jit­ters No Ev­i­dence of Stu­pidity in RL

1a3orn16 Sep 2021 22:43 UTC
82 points
18 comments3 min readLW link

Re­in­force­ment Learn­ing in the Iter­ated Am­plifi­ca­tion Framework

William_S9 Feb 2019 0:56 UTC
25 points
12 comments4 min readLW link

Re­in­force­ment learn­ing with im­per­cep­ti­ble rewards

Vanessa Kosoy7 Apr 2019 10:27 UTC
26 points
1 comment29 min readLW link

Re­in­force­ment Learn­ing: A Non-Stan­dard In­tro­duc­tion (Part 1)

royf29 Jul 2012 0:13 UTC
33 points
19 comments2 min readLW link

Re­in­force­ment, Prefer­ence and Utility

royf8 Aug 2012 6:23 UTC
13 points
5 comments3 min readLW link

Re­in­force­ment Learn­ing: A Non-Stan­dard In­tro­duc­tion (Part 2)

royf2 Aug 2012 8:17 UTC
16 points
7 comments3 min readLW link

Ap­ply­ing re­in­force­ment learn­ing the­ory to re­duce felt tem­po­ral distance

Kaj_Sotala26 Jan 2014 9:17 UTC
19 points
6 comments3 min readLW link

Imi­ta­tive Re­in­force­ment Learn­ing as an AGI Approach

TIMUR ZEKI VURAL21 May 2018 14:47 UTC
1 point
1 comment1 min readLW link

Del­ega­tive Re­in­force­ment Learn­ing with a Merely Sane Advisor

Vanessa Kosoy5 Oct 2017 14:15 UTC
1 point
2 comments14 min readLW link

In­verse re­in­force­ment learn­ing on self, pre-on­tol­ogy-change

Stuart_Armstrong18 Nov 2015 13:23 UTC
0 points
0 comments1 min readLW link

Clar­ifi­ca­tion: Be­havi­ourism & Reinforcement

Zaine10 Oct 2012 5:30 UTC
13 points
30 comments2 min readLW link

Is Global Re­in­force­ment Learn­ing (RL) a Fan­tasy?

[deleted]31 Oct 2016 1:49 UTC
5 points
50 comments12 min readLW link

Del­ega­tive In­verse Re­in­force­ment Learning

Vanessa Kosoy12 Jul 2017 12:18 UTC
15 points
0 comments16 min readLW link

Vec­tor-Valued Re­in­force­ment Learning

orthonormal1 Nov 2016 0:21 UTC
2 points
0 comments4 min readLW link

Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing vs. Ir­ra­tional Hu­man Preferences

orthonormal18 Jun 2016 0:55 UTC
14 points
0 comments3 min readLW link

Evolu­tion as Back­stop for Re­in­force­ment Learn­ing: multi-level paradigms

gwern12 Jan 2019 17:45 UTC
19 points
0 comments1 min readLW link
(www.gwern.net)

IRL 1/​8: In­verse Re­in­force­ment Learn­ing and the prob­lem of degeneracy

RAISE4 Mar 2019 13:11 UTC
20 points
2 comments1 min readLW link
(app.grasple.com)

Keep­ing up with deep re­in­force­ment learn­ing re­search: /​r/​reinforcementlearning

gwern16 May 2017 19:12 UTC
6 points
2 comments1 min readLW link
(www.reddit.com)

psy­chol­ogy and ap­pli­ca­tions of re­in­force­ment learn­ing: where do I learn more?

jsalvatier26 Jun 2011 20:56 UTC
5 points
1 comment1 min readLW link

Re­ward/​value learn­ing for re­in­force­ment learning

Stuart_Armstrong2 Jun 2017 16:34 UTC
0 points
0 comments2 min readLW link

Model Mis-speci­fi­ca­tion and In­verse Re­in­force­ment Learning

9 Nov 2018 15:33 UTC
33 points
3 comments16 min readLW link

“Hu­man-level con­trol through deep re­in­force­ment learn­ing”—com­puter learns 49 differ­ent games

skeptical_lurker26 Feb 2015 6:21 UTC
19 points
19 comments1 min readLW link

Mak­ing a Differ­ence Tem­pore: In­sights from ‘Re­in­force­ment Learn­ing: An In­tro­duc­tion’

TurnTrout5 Jul 2018 0:34 UTC
33 points
6 comments8 min readLW link

Suffi­ciently Ad­vanced Lan­guage Models Can Do Re­in­force­ment Learning

Zachary Robertson2 Aug 2020 15:32 UTC
21 points
7 comments7 min readLW link

Prob­lems in­te­grat­ing de­ci­sion the­ory and in­verse re­in­force­ment learning

agilecaveman8 May 2018 5:11 UTC
7 points
2 comments3 min readLW link

“AIXIjs: A Soft­ware Demo for Gen­eral Re­in­force­ment Learn­ing”, As­lanides 2017

gwern29 May 2017 21:09 UTC
7 points
1 comment1 min readLW link
(arxiv.org)

Some work on con­nect­ing UDT and Re­in­force­ment Learning

IAFF-User-11117 Dec 2015 23:58 UTC
4 points
0 comments1 min readLW link
(drive.google.com)

[Question] What prob­lem would you like to see Re­in­force­ment Learn­ing ap­plied to?

Julian Schrittwieser8 Jul 2020 2:40 UTC
43 points
4 comments1 min readLW link

[Question] What messy prob­lems do you see Deep Re­in­force­ment Learn­ing ap­pli­ca­ble to?

Riccardo Volpato5 Apr 2020 17:43 UTC
5 points
0 comments1 min readLW link

[Question] Can co­her­ent ex­trap­o­lated vo­li­tion be es­ti­mated with In­verse Re­in­force­ment Learn­ing?

Jade Bishop15 Apr 2019 3:23 UTC
12 points
5 comments3 min readLW link

Model­ing the ca­pa­bil­ities of ad­vanced AI sys­tems as epi­sodic re­in­force­ment learning

jessicata19 Aug 2016 2:52 UTC
4 points
0 comments5 min readLW link

FHI is ac­cept­ing ap­pli­ca­tions for in­tern­ships in the area of AI Safety and Re­in­force­ment Learning

crmflynn7 Nov 2016 16:33 UTC
9 points
0 comments1 min readLW link
(www.fhi.ox.ac.uk)

AXRP Epi­sode 1 - Ad­ver­sar­ial Poli­cies with Adam Gleave

DanielFilan29 Dec 2020 20:41 UTC
12 points
5 comments33 min readLW link

AXRP Epi­sode 3 - Ne­go­tiable Re­in­force­ment Learn­ing with An­drew Critch

DanielFilan29 Dec 2020 20:45 UTC
26 points
0 comments27 min readLW link

Multi-di­men­sional re­wards for AGI in­ter­pretabil­ity and control

Steven Byrnes4 Jan 2021 3:08 UTC
19 points
8 comments10 min readLW link

Is RL in­volved in sen­sory pro­cess­ing?

Steven Byrnes18 Mar 2021 13:57 UTC
21 points
21 comments5 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
66 points
40 comments16 min readLW link

My take on Michael Littman on “The HCI of HAI”

Alex Flint2 Apr 2021 19:51 UTC
59 points
4 comments7 min readLW link

Notes from “Don’t Shoot the Dog”

juliawise2 Apr 2021 16:34 UTC
212 points
10 comments12 min readLW link

Big pic­ture of pha­sic dopamine

Steven Byrnes8 Jun 2021 13:07 UTC
59 points
18 comments36 min readLW link

Sup­ple­ment to “Big pic­ture of pha­sic dopamine”

Steven Byrnes8 Jun 2021 13:08 UTC
13 points
2 comments9 min readLW link

Re­ward Is Not Enough

Steven Byrnes16 Jun 2021 13:52 UTC
105 points
18 comments10 min readLW link

A model of de­ci­sion-mak­ing in the brain (the short ver­sion)

Steven Byrnes18 Jul 2021 14:39 UTC
20 points
0 comments3 min readLW link

Deep­Mind: Gen­er­ally ca­pa­ble agents emerge from open-ended play

Daniel Kokotajlo27 Jul 2021 14:19 UTC
247 points
53 comments2 min readLW link
(deepmind.com)

Train­ing My Friend to Cook

lsusr29 Aug 2021 5:54 UTC
68 points
33 comments3 min readLW link

Multi-Agent In­verse Re­in­force­ment Learn­ing: Subop­ti­mal De­mon­stra­tions and Alter­na­tive Solu­tion Concepts

sage_bergerson7 Sep 2021 16:11 UTC
5 points
0 comments1 min readLW link

My take on Vanessa Kosoy’s take on AGI safety

Steven Byrnes30 Sep 2021 12:23 UTC
84 points
10 comments31 min readLW link

Scalar re­ward is not enough for al­igned AGI

Peter Vamplew17 Jan 2022 21:02 UTC
15 points
3 comments11 min readLW link

Emo­tions = Re­ward Functions

jpyykko20 Jan 2022 18:46 UTC
16 points
10 comments5 min readLW link

[In­tro to brain-like-AGI safety] 5. The “long-term pre­dic­tor”, and TD learning

Steven Byrnes23 Feb 2022 14:44 UTC
41 points
25 comments21 min readLW link

[In­tro to brain-like-AGI safety] 6. Big pic­ture of mo­ti­va­tion, de­ci­sion-mak­ing, and RL

Steven Byrnes2 Mar 2022 15:26 UTC
41 points
13 comments16 min readLW link

RLHF

Ansh Radhakrishnan12 May 2022 21:18 UTC
16 points
5 comments5 min readLW link

Shard The­ory: An Overview

David Udell11 Aug 2022 5:44 UTC
129 points
34 comments10 min readLW link

AlphaS­tar: Im­pres­sive for RL progress, not for AGI progress

orthonormal2 Nov 2019 1:50 UTC
113 points
58 comments2 min readLW link1 review

[Question] How is re­in­force­ment learn­ing pos­si­ble in non-sen­tient agents?

SomeoneKind5 Jan 2021 20:57 UTC
3 points
5 comments1 min readLW link

Which an­i­mals can suffer?

Just Learning1 Jun 2021 3:42 UTC
7 points
16 comments1 min readLW link

In­stru­men­tal Con­ver­gence: Power as Rademacher Complexity

Zachary Robertson12 Aug 2021 16:02 UTC
6 points
0 comments3 min readLW link

Ex­trac­tion of hu­man prefer­ences 👨→🤖

arunraja-hub24 Aug 2021 16:34 UTC
18 points
2 comments5 min readLW link

A brief re­view of the rea­sons multi-ob­jec­tive RL could be im­por­tant in AI Safety Research

Ben Smith29 Sep 2021 17:09 UTC
27 points
7 comments10 min readLW link

Pro­posal: Scal­ing laws for RL generalization

axioman1 Oct 2021 21:32 UTC
14 points
10 comments11 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

Effi­cien­tZero: hu­man ALE sam­ple-effi­ciency w/​MuZero+self-supervised

gwern2 Nov 2021 2:32 UTC
134 points
52 comments1 min readLW link
(arxiv.org)

Be­hav­ior Clon­ing is Miscalibrated

leogao5 Dec 2021 1:36 UTC
52 points
3 comments3 min readLW link

De­mand­ing and De­sign­ing Aligned Cog­ni­tive Architectures

Koen.Holtman21 Dec 2021 17:32 UTC
8 points
5 comments5 min readLW link

Re­in­force­ment Learn­ing Study Group

Kay Kozaronek26 Dec 2021 23:11 UTC
20 points
9 comments1 min readLW link

Ques­tion 1: Pre­dicted ar­chi­tec­ture of AGI learn­ing al­gorithm(s)

Cameron Berg10 Feb 2022 17:22 UTC
12 points
1 comment7 min readLW link

[Question] What is a train­ing “step” vs. “epi­sode” in ma­chine learn­ing?

Evan R. Murphy28 Apr 2022 21:53 UTC
9 points
4 comments1 min readLW link

Open Prob­lems in Nega­tive Side Effect Minimization

6 May 2022 9:37 UTC
12 points
7 comments17 min readLW link

RL with KL penalties is bet­ter seen as Bayesian inference

25 May 2022 9:23 UTC
90 points
15 comments12 min readLW link

Machines vs Memes Part 1: AI Align­ment and Memetics

Harriet Farlow31 May 2022 22:03 UTC
16 points
0 comments6 min readLW link

Machines vs Memes Part 3: Imi­ta­tion and Memes

ceru231 Jun 2022 13:36 UTC
5 points
0 comments7 min readLW link

[Link] OpenAI: Learn­ing to Play Minecraft with Video PreTrain­ing (VPT)

Aryeh Englander23 Jun 2022 16:29 UTC
53 points
3 comments1 min readLW link

Re­in­force­ment Learner Wireheading

Nate Showell8 Jul 2022 5:32 UTC
8 points
2 comments4 min readLW link

Re­in­force­ment Learn­ing Goal Mis­gen­er­al­iza­tion: Can we guess what kind of goals are se­lected by de­fault?

25 Oct 2022 20:48 UTC
9 points
1 comment4 min readLW link

Con­di­tion­ing, Prompts, and Fine-Tuning

Adam Jermyn17 Aug 2022 20:52 UTC
31 points
9 comments4 min readLW link

Deep Q-Net­works Explained

Jay Bailey13 Sep 2022 12:01 UTC
37 points
4 comments22 min readLW link

Lev­er­ag­ing Le­gal In­for­mat­ics to Align AI

John Nay18 Sep 2022 20:39 UTC
11 points
0 comments3 min readLW link
(forum.effectivealtruism.org)

Towards de­con­fus­ing wire­head­ing and re­ward maximization

leogao21 Sep 2022 0:36 UTC
69 points
7 comments4 min readLW link

Re­ward IS the Op­ti­miza­tion Target

Carn28 Sep 2022 17:59 UTC
−1 points
3 comments5 min readLW link

[Question] What Is the Idea Be­hind (Un-)Su­per­vised Learn­ing and Re­in­force­ment Learn­ing?

Morpheus30 Sep 2022 16:48 UTC
9 points
6 comments2 min readLW link

In­stru­men­tal con­ver­gence in sin­gle-agent systems

12 Oct 2022 12:24 UTC
27 points
4 comments8 min readLW link
(www.gladstone.ai)

Misal­ign­ment-by-de­fault in multi-agent systems

13 Oct 2022 15:38 UTC
17 points
8 comments20 min readLW link
(www.gladstone.ai)

In­stru­men­tal con­ver­gence: scale and phys­i­cal interactions

14 Oct 2022 15:50 UTC
15 points
0 comments17 min readLW link
(www.gladstone.ai)

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John Nay21 Oct 2022 2:03 UTC
3 points
18 comments54 min readLW link

POWER­play: An open-source toolchain to study AI power-seeking

Edouard Harris24 Oct 2022 20:03 UTC
22 points
0 comments1 min readLW link
(github.com)

AGIs may value in­trin­sic re­wards more than ex­trin­sic ones

catubc17 Nov 2022 21:49 UTC
8 points
6 comments4 min readLW link

A Short Dialogue on the Mean­ing of Re­ward Functions

19 Nov 2022 21:04 UTC
40 points
0 comments3 min readLW link

Hu­man-level Full-Press Di­plo­macy (some bare facts).

strawberry calm22 Nov 2022 20:59 UTC
50 points
7 comments3 min readLW link

Sets of ob­jec­tives for a multi-ob­jec­tive RL agent to optimize

23 Nov 2022 6:49 UTC
4 points
0 comments8 min readLW link