Re­in­force­ment Learning

TagLast edit: 26 Nov 2021 14:17 UTC by Multicore

Within the field of Machine Learning, reinforcement learning refers to the study of how an agent should choose its actions within an environment in order to maximize some kind of reward. Strongly inspired by the work developed in behavioral psychology it is essentially a trial and error approach to find the best strategy.

Related: Inverse Reinforcement Learning, Machine learning, Friendly AI, Game Theory, Prediction

Consider an agent that receives an input informing the agent of the environment’s state. Based only on that information, the agent has to make a decision regarding which action to take, from a set, which will influence the state of the environment. This action will in itself change the state of the environment, which will result in a new input, and so on, each time also presenting the agent with the reward relative to its actions in the environment. The agent’s goal is then to find the ideal strategy which will give the highest reward expectations over time, based on previous experience.

Exploration and Optimization

Knowing that randomly selecting the actions will result in poor performances, one of the biggest problems in reinforcement learning is exploring the avaliable set of responses to avoid getting stuck in sub-optimal choices and proceed to better ones.

This is the problem of exploration, which is best described in the most studied reinforcement learning problem—the k-armed bandit. In it, an agent has to decide which sequence of levers to pull in a gambling room, not having any information about the probabilities of winning in each machine besides the reward it receives each time. The problem revolves about deciding which is the optimal lever and what criteria defines the lever as such.

Parallel with an exploration implementation, it is still necessary to chose the criteria which makes a certain action optimal when compared to another. This study of this property has led to several methods, from brute forcing to taking into account temporal differences in the received reward. Despite this and the great results obtained by reinforcement methods in solving small problems, it suffers from a lack of scalability, having difficulties solving larger, close-to-human scenarios.

Further Reading & References

See Also

Draft pa­pers for REALab and De­cou­pled Ap­proval on tampering

28 Oct 2020 16:01 UTC
47 points
2 comments1 min readLW link

Effi­cien­tZero: How It Works

1a3orn26 Nov 2021 15:17 UTC
258 points
42 comments29 min readLW link

Book Re­view: Re­in­force­ment Learn­ing by Sut­ton and Barto

billmei20 Oct 2020 19:40 UTC
52 points
3 comments10 min readLW link

Jit­ters No Ev­i­dence of Stu­pidity in RL

1a3orn16 Sep 2021 22:43 UTC
82 points
18 comments3 min readLW link

Re­in­force­ment Learn­ing in the Iter­ated Am­plifi­ca­tion Framework

William_S9 Feb 2019 0:56 UTC
25 points
12 comments4 min readLW link

Re­in­force­ment learn­ing with im­per­cep­ti­ble rewards

Vanessa Kosoy7 Apr 2019 10:27 UTC
24 points
1 comment29 min readLW link

Re­in­force­ment Learn­ing: A Non-Stan­dard In­tro­duc­tion (Part 1)

royf29 Jul 2012 0:13 UTC
33 points
19 comments2 min readLW link

Re­in­force­ment, Prefer­ence and Utility

royf8 Aug 2012 6:23 UTC
13 points
5 comments3 min readLW link

Re­in­force­ment Learn­ing: A Non-Stan­dard In­tro­duc­tion (Part 2)

royf2 Aug 2012 8:17 UTC
16 points
7 comments3 min readLW link

Ap­ply­ing re­in­force­ment learn­ing the­ory to re­duce felt tem­po­ral distance

Kaj_Sotala26 Jan 2014 9:17 UTC
19 points
6 comments3 min readLW link

Imi­ta­tive Re­in­force­ment Learn­ing as an AGI Approach

TIMUR ZEKI VURAL21 May 2018 14:47 UTC
1 point
1 comment1 min readLW link

Del­ega­tive Re­in­force­ment Learn­ing with a Merely Sane Advisor

Vanessa Kosoy5 Oct 2017 14:15 UTC
1 point
2 comments14 min readLW link

In­verse re­in­force­ment learn­ing on self, pre-on­tol­ogy-change

Stuart_Armstrong18 Nov 2015 13:23 UTC
0 points
0 comments1 min readLW link

Clar­ifi­ca­tion: Be­havi­ourism & Reinforcement

Zaine10 Oct 2012 5:30 UTC
13 points
30 comments2 min readLW link

Is Global Re­in­force­ment Learn­ing (RL) a Fan­tasy?

[deleted]31 Oct 2016 1:49 UTC
4 points
50 comments12 min readLW link

Del­ega­tive In­verse Re­in­force­ment Learning

Vanessa Kosoy12 Jul 2017 12:18 UTC
15 points
0 comments16 min readLW link

Vec­tor-Valued Re­in­force­ment Learning

orthonormal1 Nov 2016 0:21 UTC
2 points
0 comments4 min readLW link

Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing vs. Ir­ra­tional Hu­man Preferences

orthonormal18 Jun 2016 0:55 UTC
13 points
0 comments3 min readLW link

Evolu­tion as Back­stop for Re­in­force­ment Learn­ing: multi-level paradigms

gwern12 Jan 2019 17:45 UTC
17 points
0 comments1 min readLW link

IRL 1/​8: In­verse Re­in­force­ment Learn­ing and the prob­lem of degeneracy

RAISE4 Mar 2019 13:11 UTC
20 points
2 comments1 min readLW link

Keep­ing up with deep re­in­force­ment learn­ing re­search: /​r/​reinforcementlearning

gwern16 May 2017 19:12 UTC
6 points
2 comments1 min readLW link

psy­chol­ogy and ap­pli­ca­tions of re­in­force­ment learn­ing: where do I learn more?

jsalvatier26 Jun 2011 20:56 UTC
5 points
1 comment1 min readLW link

Re­ward/​value learn­ing for re­in­force­ment learning

Stuart_Armstrong2 Jun 2017 16:34 UTC
0 points
0 comments2 min readLW link

Model Mis-speci­fi­ca­tion and In­verse Re­in­force­ment Learning

9 Nov 2018 15:33 UTC
31 points
3 comments16 min readLW link

“Hu­man-level con­trol through deep re­in­force­ment learn­ing”—com­puter learns 49 differ­ent games

skeptical_lurker26 Feb 2015 6:21 UTC
19 points
19 comments1 min readLW link

Mak­ing a Differ­ence Tem­pore: In­sights from ‘Re­in­force­ment Learn­ing: An In­tro­duc­tion’

TurnTrout5 Jul 2018 0:34 UTC
33 points
6 comments8 min readLW link

Suffi­ciently Ad­vanced Lan­guage Models Can Do Re­in­force­ment Learning

Zachary Robertson2 Aug 2020 15:32 UTC
21 points
7 comments7 min readLW link

Prob­lems in­te­grat­ing de­ci­sion the­ory and in­verse re­in­force­ment learning

agilecaveman8 May 2018 5:11 UTC
7 points
2 comments3 min readLW link

“AIXIjs: A Soft­ware Demo for Gen­eral Re­in­force­ment Learn­ing”, As­lanides 2017

gwern29 May 2017 21:09 UTC
7 points
1 comment1 min readLW link

Some work on con­nect­ing UDT and Re­in­force­ment Learning

IAFF-User-11117 Dec 2015 23:58 UTC
4 points
0 comments1 min readLW link

[Question] What prob­lem would you like to see Re­in­force­ment Learn­ing ap­plied to?

Julian Schrittwieser8 Jul 2020 2:40 UTC
43 points
4 comments1 min readLW link

[Question] What messy prob­lems do you see Deep Re­in­force­ment Learn­ing ap­pli­ca­ble to?

Riccardo Volpato5 Apr 2020 17:43 UTC
5 points
0 comments1 min readLW link

[Question] Can co­her­ent ex­trap­o­lated vo­li­tion be es­ti­mated with In­verse Re­in­force­ment Learn­ing?

Jade Bishop15 Apr 2019 3:23 UTC
12 points
5 comments3 min readLW link

Model­ing the ca­pa­bil­ities of ad­vanced AI sys­tems as epi­sodic re­in­force­ment learning

jessicata19 Aug 2016 2:52 UTC
4 points
0 comments5 min readLW link

FHI is ac­cept­ing ap­pli­ca­tions for in­tern­ships in the area of AI Safety and Re­in­force­ment Learning

crmflynn7 Nov 2016 16:33 UTC
9 points
0 comments1 min readLW link

AXRP Epi­sode 1 - Ad­ver­sar­ial Poli­cies with Adam Gleave

DanielFilan29 Dec 2020 20:41 UTC
12 points
5 comments33 min readLW link

AXRP Epi­sode 3 - Ne­go­tiable Re­in­force­ment Learn­ing with An­drew Critch

DanielFilan29 Dec 2020 20:45 UTC
26 points
0 comments27 min readLW link

Multi-di­men­sional re­wards for AGI in­ter­pretabil­ity and control

Steven Byrnes4 Jan 2021 3:08 UTC
11 points
7 comments10 min readLW link

Is RL in­volved in sen­sory pro­cess­ing?

Steven Byrnes18 Mar 2021 13:57 UTC
19 points
4 comments5 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
64 points
40 comments16 min readLW link

My take on Michael Littman on “The HCI of HAI”

Alex Flint2 Apr 2021 19:51 UTC
56 points
4 comments7 min readLW link

Notes from “Don’t Shoot the Dog”

juliawise2 Apr 2021 16:34 UTC
204 points
10 comments12 min readLW link

Big pic­ture of pha­sic dopamine

Steven Byrnes8 Jun 2021 13:07 UTC
58 points
18 comments36 min readLW link

Sup­ple­ment to “Big pic­ture of pha­sic dopamine”

Steven Byrnes8 Jun 2021 13:08 UTC
12 points
2 comments9 min readLW link

Re­ward Is Not Enough

Steven Byrnes16 Jun 2021 13:52 UTC
92 points
18 comments10 min readLW link

A model of de­ci­sion-mak­ing in the brain (the short ver­sion)

Steven Byrnes18 Jul 2021 14:39 UTC
14 points
0 comments3 min readLW link

Deep­Mind: Gen­er­ally ca­pa­ble agents emerge from open-ended play

Daniel Kokotajlo27 Jul 2021 14:19 UTC
245 points
53 comments2 min readLW link

Train­ing My Friend to Cook

lsusr29 Aug 2021 5:54 UTC
66 points
33 comments3 min readLW link

Multi-Agent In­verse Re­in­force­ment Learn­ing: Subop­ti­mal De­mon­stra­tions and Alter­na­tive Solu­tion Concepts

sage_bergerson7 Sep 2021 16:11 UTC
5 points
0 comments1 min readLW link

My take on Vanessa Kosoy’s take on AGI safety

Steven Byrnes30 Sep 2021 12:23 UTC
75 points
10 comments31 min readLW link

Scalar re­ward is not enough for al­igned AGI

Peter Vamplew17 Jan 2022 21:02 UTC
15 points
3 comments11 min readLW link

Emo­tions = Re­ward Functions

jpyykko20 Jan 2022 18:46 UTC
16 points
10 comments5 min readLW link

[In­tro to brain-like-AGI safety] 5. The “long-term pre­dic­tor”, and TD learning

Steven Byrnes23 Feb 2022 14:44 UTC
30 points
21 comments22 min readLW link

[In­tro to brain-like-AGI safety] 6. Big pic­ture of mo­ti­va­tion, de­ci­sion-mak­ing, and RL

Steven Byrnes2 Mar 2022 15:26 UTC
25 points
13 comments16 min readLW link


Ansh Radhakrishnan12 May 2022 21:18 UTC
13 points
5 comments5 min readLW link

AlphaS­tar: Im­pres­sive for RL progress, not for AGI progress

orthonormal2 Nov 2019 1:50 UTC
113 points
58 comments2 min readLW link1 review

[Question] How is re­in­force­ment learn­ing pos­si­ble in non-sen­tient agents?

SomeoneKind5 Jan 2021 20:57 UTC
3 points
5 comments1 min readLW link

Which an­i­mals can suffer?

Just Learning1 Jun 2021 3:42 UTC
7 points
16 comments1 min readLW link

In­stru­men­tal Con­ver­gence: Power as Rademacher Complexity

Zachary Robertson12 Aug 2021 16:02 UTC
6 points
0 comments3 min readLW link

Ex­trac­tion of hu­man prefer­ences 👨→🤖

arunraja-hub24 Aug 2021 16:34 UTC
18 points
2 comments5 min readLW link

A brief re­view of the rea­sons multi-ob­jec­tive RL could be im­por­tant in AI Safety Research

Ben Smith29 Sep 2021 17:09 UTC
27 points
7 comments10 min readLW link

Pro­posal: Scal­ing laws for RL generalization

flodorner1 Oct 2021 21:32 UTC
14 points
10 comments11 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

Effi­cien­tZero: hu­man ALE sam­ple-effi­ciency w/​MuZero+self-supervised

gwern2 Nov 2021 2:32 UTC
134 points
52 comments1 min readLW link

Be­hav­ior Clon­ing is Miscalibrated

leogao5 Dec 2021 1:36 UTC
52 points
3 comments3 min readLW link

De­mand­ing and De­sign­ing Aligned Cog­ni­tive Architectures

Koen.Holtman21 Dec 2021 17:32 UTC
8 points
5 comments5 min readLW link

Re­in­force­ment Learn­ing Study Group

Kay Kozaronek26 Dec 2021 23:11 UTC
20 points
9 comments1 min readLW link

Ques­tion 1: Pre­dicted ar­chi­tec­ture of AGI learn­ing al­gorithm(s)

Cameron Berg10 Feb 2022 17:22 UTC
9 points
1 comment7 min readLW link

[Question] What is a train­ing “step” vs. “epi­sode” in ma­chine learn­ing?

Evan R. Murphy28 Apr 2022 21:53 UTC
9 points
4 comments1 min readLW link

Open Prob­lems in Nega­tive Side Effect Minimization

6 May 2022 9:37 UTC
12 points
3 comments17 min readLW link

RL with KL penalties is bet­ter seen as Bayesian inference

25 May 2022 9:23 UTC
69 points
12 comments12 min readLW link

Machines vs Memes Part 1: AI Align­ment and Memetics

Harriet Farlow31 May 2022 22:03 UTC
16 points
0 comments6 min readLW link

Machines vs Memes Part 3: Imi­ta­tion and Memes

ceru231 Jun 2022 13:36 UTC
5 points
0 comments7 min readLW link

[Link] OpenAI: Learn­ing to Play Minecraft with Video PreTrain­ing (VPT)

Aryeh Englander23 Jun 2022 16:29 UTC
53 points
3 comments1 min readLW link
No comments.