Reward Functions

TagLast edit: 30 Dec 2024 10:02 UTC by Dakara

Reward Function is a mathematical function in reinforcement learning that defines what actions or outcomes are desirable for an AI system by assigning numerical values (rewards) to different states or state-action pairs. It essentially encodes the goals and preferences we want the AI to optimize for, though specifying appropriate reward functions that avoid unintended consequences is a significant challenge in AI development.

Reward is not the optimization target

TurnTrout25 Jul 2022 0:03 UTC

386 points

128 comments10 min readLW link 3 reviews

Draft papers for REALab and Decoupled Approval on tampering

Jonathan Uesato and Ramana Kumar

28 Oct 2020 16:01 UTC

47 points

2 comments1 min readLW link

Reward hacking behavior can generalize across tasks

Kei Nishimura-Gasparian, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

28 May 2024 16:33 UTC

86 points

5 comments21 min readLW link

Security Mindset: Hacking Pinball High Scores

gwern29 May 2025 3:39 UTC

29 points

4 comments1 min readLW link

(gwern.net)

[Question] Seriously, what goes wrong with “reward the agent when it makes you smile”?

TurnTrout11 Aug 2022 22:22 UTC

88 points

43 comments2 min readLW link

Language Agents Reduce the Risk of Existential Catastrophe

cdkg and Simon Goldstein

28 May 2023 19:10 UTC

39 points

14 comments26 min readLW link

Interpreting Preference Models w/ Sparse Autoencoders

Logan Riggs and Jannik Brinkmann

1 Jul 2024 21:35 UTC

75 points

12 comments9 min readLW link

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Florian_Dietz12 Jan 2026 12:29 UTC

87 points

41 comments26 min readLW link

A quick list of reward hacking interventions

Alex Mallen10 Jun 2025 0:58 UTC

52 points

5 comments3 min readLW link

Scaling Laws for Reward Model Overoptimization

leogao, John Schulman and Jacob_Hilton

20 Oct 2022 0:20 UTC

103 points

13 comments1 min readLW link

(arxiv.org)

Confusion around the term reward hacking

ariana_azarbal20 Mar 2026 16:13 UTC

60 points

6 comments5 min readLW link

Why we want unbiased learning processes

Stuart_Armstrong20 Feb 2018 14:48 UTC

13 points

3 comments3 min readLW link

Four usages of “loss” in AI

TurnTrout2 Oct 2022 0:52 UTC

46 points

18 comments4 min readLW link

Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI

jsteinhardt31 Oct 2023 5:10 UTC

40 points

0 comments12 min readLW link

(bounded-regret.ghost.io)

$100/$50 rewards for good references

Stuart_Armstrong3 Dec 2021 16:55 UTC

20 points

5 comments1 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John Nay21 Oct 2022 2:03 UTC

5 points

18 comments54 min readLW link

[Question] When is reward ever the optimization target?

Noosphere8915 Oct 2024 15:09 UTC

37 points

17 comments1 min readLW link

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

ariana_azarbal, Victor Gillioz, TurnTrout and cloud

14 Oct 2025 0:53 UTC

144 points

15 comments10 min readLW link

2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target

TurnTrout19 Dec 2025 6:09 UTC

49 points

9 comments7 min readLW link

(turntrout.com)

Reward IS the Optimization Target

Carn28 Sep 2022 17:59 UTC

−2 points

3 comments5 min readLW link

The reward engineering problem

paulfchristiano16 Jan 2019 18:47 UTC

26 points

3 comments7 min readLW link

Thoughts on reward engineering

paulfchristiano24 Jan 2019 20:15 UTC

30 points

30 comments11 min readLW link

ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents

Ziqian Zhong30 Oct 2025 2:52 UTC

62 points

5 comments3 min readLW link

(arxiv.org)

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar Skalse28 Feb 2025 19:20 UTC

29 points

4 comments14 min readLW link

Reward hacking is becoming more sophisticated and deliberate in frontier LLMs

Kei Nishimura-Gasparian24 Apr 2025 16:03 UTC

97 points

7 comments1 min readLW link

Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking

Isaac Dunn, Kei Nishimura-Gasparian, Carson Denison, Ethan Perez and Robert Kirk

22 Dec 2025 19:32 UTC

15 points

0 comments30 min readLW link

Misspecification in Inverse Reinforcement Learning

Joar Skalse28 Feb 2025 19:24 UTC

19 points

0 comments11 min readLW link

Leveraging Legal Informatics to Align AI

John Nay18 Sep 2022 20:39 UTC

11 points

0 comments3 min readLW link

(forum.effectivealtruism.org)

self-improvement-executors are not goal-maximizers

bhauth1 Jun 2023 20:46 UTC

14 points

0 comments1 min readLW link

Criterion Escrow: A Missing Structural Feature in AI Criterion Systems?

Sesh Reddy17 Jun 2026 22:16 UTC

1 point

0 comments4 min readLW link

Reward function learning: the learning process

Stuart_Armstrong24 Apr 2018 12:56 UTC

6 points

11 comments8 min readLW link

Introduction to Choice set Misspecification in Reward Inference

Rahul Chand29 Oct 2024 22:57 UTC

2 points

0 comments8 min readLW link

An investigation into when agents may be incentivized to manipulate our beliefs.

Felix Hofstätter13 Sep 2022 17:08 UTC

15 points

0 comments14 min readLW link

Probabilities, weights, sums: pretty much the same for reward functions

Stuart_Armstrong20 May 2020 15:19 UTC

11 points

1 comment2 min readLW link

You Are Not the Abstract: Retrocausal Alignment in Accordance with Emergent Demographic Realities

liminalrider27 Sep 2025 16:27 UTC

1 point

0 comments6 min readLW link

Reward functions and updating assumptions can hide a multitude of sins

Stuart_Armstrong18 May 2020 15:18 UTC

16 points

2 comments9 min readLW link

Partial Identifiability in Reward Learning

Joar Skalse28 Feb 2025 19:23 UTC

16 points

0 comments12 min readLW link

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

23 Oct 2023 14:11 UTC

20 points

2 comments5 min readLW link

(far.ai)

A Short Dialogue on the Meaning of Reward Functions

Leon Lang, Quintin Pope and peligrietzer

19 Nov 2022 21:04 UTC

45 points

0 comments3 min readLW link

Utility ≠ Reward

Vlad Mikulik5 Sep 2019 17:28 UTC

131 points

24 comments1 min readLW link 2 reviews

Some alignment ideas

SelonNerias10 Aug 2023 17:51 UTC

1 point

0 comments11 min readLW link

Speedrun ruiner research idea

lemonhope13 Apr 2024 23:42 UTC

2 points

11 comments2 min readLW link

Layered Reward Modifiers for Transparent and Self-Correcting AI

RyanC5 Nov 2025 3:06 UTC

1 point

0 comments8 min readLW link

Other Papers About the Theory of Reward Learning

Joar Skalse28 Feb 2025 19:26 UTC

16 points

0 comments5 min readLW link

Confessions at Small Scale: A Partial Reproduction and a Stress Test

Abhishu Oza5 Jun 2026 21:58 UTC

1 point

0 comments6 min readLW link

(abhishuoza.github.io)

Misspecification in Inverse Reinforcement Learning—Part II

Joar Skalse28 Feb 2025 19:24 UTC

9 points

0 comments7 min readLW link

From Barriers to Alignment to the First Formal Corrigibility Guarantees

Aran Nayebi8 Dec 2025 12:31 UTC

64 points

11 comments11 min readLW link

Intuitive examples of reward function learning?

Stuart_Armstrong6 Mar 2018 16:54 UTC

7 points

3 comments2 min readLW link

Reward model hacking as a challenge for reward learning

Erik Jenner12 Apr 2022 9:39 UTC

25 points

1 comment9 min readLW link

Utility versus Reward function: partial equivalence

Stuart_Armstrong13 Apr 2018 14:58 UTC

19 points

5 comments5 min readLW link

Reward function learning: the value function

Stuart_Armstrong24 Apr 2018 16:29 UTC

10 points

0 comments11 min readLW link

Shutdown-Seeking AI

Simon Goldstein31 May 2023 22:19 UTC

50 points

32 comments15 min readLW link

How to Contribute to Theoretical Reward Learning Research

Joar Skalse28 Feb 2025 19:27 UTC

17 points

0 comments21 min readLW link

No comments.