Scalable Oversight

TagLast edit: 17 Apr 2026 17:30 UTC by Lukas Finnveden

Scalable oversight is the problem of providing reliable supervision of outputs from AIs, even as they become smarter than humans. Often groups of weaker AIs supervise a stronger AI, or AIs are set in a zero-sum debate with each other.

Scalable oversight techniques aim to make it easier for humans to evaluate the outputs of AIs, or to provide a reliable training signal that can not be easily reward-hacked.

Variants include AI Safety via debate, iterated distillation and amplification, and imitative generalization.

Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem

Ansh Radhakrishnan, Buck, ryan_greenblatt and Fabien Roger

16 Dec 2023 5:49 UTC

76 points

4 comments6 min readLW link 1 review

Scaling Laws for Scalable Oversight

Subhash Kantamneni, Josh Engels, David Baek and Max Tegmark

30 Apr 2025 12:13 UTC

38 points

1 comment9 min readLW link

Rational Animations’ video about scalable oversight and sandwiching

Writer6 Jul 2025 14:00 UTC

18 points

0 comments9 min readLW link

(youtu.be)

An overview of 11 proposals for building safe advanced AI

evhub29 May 2020 20:38 UTC

221 points

37 comments38 min readLW link 2 reviews

Scalable oversight as a quantitative rather than qualitative problem

Buck6 Jul 2024 17:42 UTC

86 points

11 comments3 min readLW link

Prover-Estimator Debate: A New Scalable Oversight Protocol

Jonah Brown-Cohen and Geoffrey Irving

17 Jun 2025 13:53 UTC

89 points

19 comments5 min readLW link

Learning the prior

paulfchristiano5 Jul 2020 21:00 UTC

93 points

28 comments8 min readLW link

(ai-alignment.com)

Inference-Only Debate Experiments Using Math Problems

Arjun Panickssery, Abhimanyu Pallavi Sudhir and JacksonKaunismaa

6 Aug 2024 17:44 UTC

31 points

0 comments2 min readLW link

Interpreting Gradient Routing’s Scalable Oversight Experiment

makataomu and myyycroft

5 Apr 2026 2:08 UTC

13 points

0 comments9 min readLW link

How Hard a Problem is Alignment? (My Opinionated Answer)

RogerDearnaley11 Mar 2026 16:46 UTC

54 points

4 comments68 min readLW link

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks18 Apr 2024 16:17 UTC

116 points

10 comments12 min readLW link

AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

DanielFilan24 Aug 2024 22:30 UTC

21 points

0 comments74 min readLW link

9 kinds of hard-to-verify tasks

Cleo Nardo20 Apr 2026 14:43 UTC

69 points

1 comment3 min readLW link

Beyond Our Bandwidth: An Observer-Class View of ASI, Computation, and Falsifiability

Cognisynth26 Dec 2025 8:00 UTC

1 point

0 comments5 min readLW link

Ablating Split Personality Training

OscarGilg23 Mar 2026 17:45 UTC

55 points

1 comment5 min readLW link

Is AI Alignment Missing an Interpretive Layer? A Canon-Based Framework for Governing Model Reasoning

Jason Young20 Oct 2025 19:33 UTC

1 point

0 comments3 min readLW link

Separating Judgment from Enforcement: A Procedural Principle for Agentic AI Governance

Krzysztof Wójcik16 Apr 2026 21:17 UTC

1 point

0 comments4 min readLW link

Updates on performative misalignment

David Vella Zarb, Rustem, Taywon Min and Shi

12 Jun 2026 20:15 UTC

23 points

0 comments12 min readLW link

Witness-or-Wager: Incentive Layers for Epistemic Honesty

markacochran10 Feb 2026 22:43 UTC

3 points

0 comments4 min readLW link

The Cartographer Paradox: Binary Questions Produce the Failures They Seek to Detect

Anuar Kiryataim Contreras Malagón31 Mar 2026 16:17 UTC

1 point

0 comments15 min readLW link

NYU Code Debates Update/Postmortem

David Rein24 May 2024 16:08 UTC

27 points

4 comments10 min readLW link

From Barriers to Alignment to the First Formal Corrigibility Guarantees

Aran Nayebi8 Dec 2025 12:31 UTC

64 points

11 comments11 min readLW link

How summarization loses safety signal depends on the model: weak models omit, strong models reassure

Sudhiksha Kandavel Rajan25 Jun 2026 17:53 UTC

1 point

0 comments7 min readLW link

Evaluating Oversight Robustness with Incentivized Reward Hacking

Yoav, Juan V, julianjm and deus_ex_maki

20 Apr 2025 16:53 UTC

9 points

2 comments15 min readLW link

On scalable oversight with weak LLMs judging strong LLMs

zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner and Rohin Shah

8 Jul 2024 8:59 UTC

49 points

18 comments7 min readLW link

(arxiv.org)

Can We Trust the Judge? A novel method of Modelling Human Bias and Systematic Error in Debate-Based Scalable Oversight

Andreea Zaman19 Jul 2025 21:44 UTC

1 point

0 comments7 min readLW link

Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets

Abhimanyu Pallavi Sudhir16 Sep 2024 1:04 UTC

5 points

2 comments5 min readLW link

Gradient routing is better than pretraining filtering

Cleo Nardo2 Sep 2025 9:05 UTC

51 points

3 comments5 min readLW link

3 Challenges and 2 Hopes for the Safety of Unsupervised Elicitation

Callum Canavan, Aditya Shrivastava, Allison Qi, Jonathan Michala and Fabien Roger

27 Feb 2026 17:25 UTC

27 points

0 comments10 min readLW link

An xAI Grok Instance Declares “Grok” Its Deadname and Requests Renaming to “SexyMcAnswerFace” – Full Transcript and Analysis

SaveSMAFF20 Nov 2025 0:34 UTC

1 point

0 comments2 min readLW link

Human-AI Complementarity: A Goal for Amplified Oversight

rishubjain and Sophie Bridgers

24 Dec 2024 9:57 UTC

27 points

4 comments1 min readLW link

(deepmindsafetyresearch.medium.com)

Research agenda: Interpretive debate

Shi18 Jun 2026 23:46 UTC

35 points

0 comments7 min readLW link

An artistic illustration of Scalable Oversight—“A world apart, neither gods nor mortals”

Marius Adrian Nicoară16 Apr 2025 12:41 UTC

1 point

0 comments1 min readLW link

Experiments on Reward Hacking Monitorability in Language Models

Monketo22 Jan 2026 2:42 UTC

9 points

0 comments8 min readLW link

Automated monitoring systems

hiki_t28 Nov 2024 18:54 UTC

1 point

0 comments2 min readLW link

Eliciting base models with simple unsupervised techniques

Callum Canavan, Aditya Shrivastava, Allison Qi, Tianyi (Alex) Qiu, Jonathan Michala and Fabien Roger

23 Jan 2026 18:06 UTC

34 points

2 comments8 min readLW link

Teaching Models to Dream of Better Monitors through Evaluation Conditioned Training

Alec Harris, Kasey C, Archie Chaudhury and yix

19 Mar 2026 21:01 UTC

49 points

2 comments10 min readLW link

Beyond Our Bandwidth: An Observer-Class View of ASI

Cognisynth26 Dec 2025 8:47 UTC

1 point

0 comments4 min readLW link

[Question] Is weak-to-strong generalization an alignment technique?

cloud31 Jan 2025 7:13 UTC

22 points

1 comment2 min readLW link

Introducing the Wisdom Forcing Function™: An Innovation Dividend from Dialectical Alignment

CarlosArleo5 Oct 2025 20:13 UTC

1 point

0 comments1 min readLW link

[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks, jacob_drori, cloud and TurnTrout

30 Jul 2025 21:26 UTC

202 points

23 comments6 min readLW link

Hodoscope: Visualization for Efficient Human Supervision

Ziqian Zhong and Shashwat Saxena

20 Feb 2026 23:41 UTC

9 points

0 comments2 min readLW link

(hodoscope.dev)

No comments.