Reward Functions

Tag

Speedrun ruiner research idea

lukehmiles13 Apr 2024 23:42 UTC

4 points

11 comments2 min readLW link

Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI

jsteinhardt31 Oct 2023 5:10 UTC

40 points

0 comments12 min readLW link

(bounded-regret.ghost.io)

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

23 Oct 2023 14:11 UTC

20 points

2 comments5 min readLW link

(far.ai)

Some alignment ideas

SelonNerias10 Aug 2023 17:51 UTC

1 point

0 comments11 min readLW link

self-improvement-executors are not goal-maximizers

bhauth1 Jun 2023 20:46 UTC

14 points

0 comments1 min readLW link

Shutdown-Seeking AI

Simon Goldstein31 May 2023 22:19 UTC

48 points

31 comments15 min readLW link

Language Agents Reduce the Risk of Existential Catastrophe

cdkg and Simon Goldstein

28 May 2023 19:10 UTC

30 points

14 comments26 min readLW link

A Short Dialogue on the Meaning of Reward Functions

Leon Lang, Quintin Pope and peligrietzer

19 Nov 2022 21:04 UTC

45 points

0 comments3 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John Nay21 Oct 2022 2:03 UTC

5 points

18 comments54 min readLW link

Scaling Laws for Reward Model Overoptimization

leogao, John Schulman and Jacob_Hilton

20 Oct 2022 0:20 UTC

102 points

13 comments1 min readLW link

(arxiv.org)

Four usages of “loss” in AI

TurnTrout2 Oct 2022 0:52 UTC

43 points

18 comments4 min readLW link

Reward IS the Optimization Target

Carn28 Sep 2022 17:59 UTC

−2 points

3 comments5 min readLW link

Leveraging Legal Informatics to Align AI

John Nay18 Sep 2022 20:39 UTC

11 points

0 comments3 min readLW link

(forum.effectivealtruism.org)

An investigation into when agents may be incentivized to manipulate our beliefs.

Felix Hofstätter13 Sep 2022 17:08 UTC

15 points

0 comments14 min readLW link

[Question] Seriously, what goes wrong with “reward the agent when it makes you smile”?

TurnTrout11 Aug 2022 22:22 UTC

86 points

42 comments2 min readLW link

Reward is not the optimization target

TurnTrout25 Jul 2022 0:03 UTC

348 points

123 comments10 min readLW link 3 reviews

Reward model hacking as a challenge for reward learning

Erik Jenner12 Apr 2022 9:39 UTC

25 points

1 comment9 min readLW link

$100/$50 rewards for good references

Stuart_Armstrong3 Dec 2021 16:55 UTC

20 points

5 comments1 min readLW link

Draft papers for REALab and Decoupled Approval on tampering

Jonathan Uesato and Ramana Kumar

28 Oct 2020 16:01 UTC

47 points

2 comments1 min readLW link

Probabilities, weights, sums: pretty much the same for reward functions

Stuart_Armstrong20 May 2020 15:19 UTC

11 points

1 comment2 min readLW link

Reward functions and updating assumptions can hide a multitude of sins

Stuart_Armstrong18 May 2020 15:18 UTC

16 points

2 comments9 min readLW link

Utility ≠ Reward

Vlad Mikulik5 Sep 2019 17:28 UTC

121 points

24 comments1 min readLW link 2 reviews

Thoughts on reward engineering

paulfchristiano24 Jan 2019 20:15 UTC

30 points

30 comments11 min readLW link

The reward engineering problem

paulfchristiano16 Jan 2019 18:47 UTC

26 points

3 comments7 min readLW link

Reward function learning: the value function

Stuart_Armstrong24 Apr 2018 16:29 UTC

10 points

0 comments11 min readLW link

Reward function learning: the learning process

Stuart_Armstrong24 Apr 2018 12:56 UTC

6 points

11 comments8 min readLW link

Utility versus Reward function: partial equivalence

Stuart_Armstrong13 Apr 2018 14:58 UTC

18 points

5 comments5 min readLW link

Intuitive examples of reward function learning?

Stuart_Armstrong6 Mar 2018 16:54 UTC

7 points

3 comments2 min readLW link

Why we want unbiased learning processes

Stuart_Armstrong20 Feb 2018 14:48 UTC

13 points

3 comments3 min readLW link

No comments.

Re­ward Functions

Reward Functions