RSS

Scal­able Oversight

TagLast edit: 18 Apr 2024 19:57 UTC by Raemon

Scal­ing Laws for Scal­able Oversight

30 Apr 2025 12:13 UTC
37 points
1 comment9 min readLW link

Learn­ing the prior

paulfchristiano5 Jul 2020 21:00 UTC
92 points
28 comments8 min readLW link
(ai-alignment.com)

In­fer­ence-Only De­bate Ex­per­i­ments Us­ing Math Problems

6 Aug 2024 17:44 UTC
31 points
0 comments2 min readLW link

Ra­tional An­i­ma­tions’ video about scal­able over­sight and sandwiching

Writer6 Jul 2025 14:00 UTC
18 points
0 comments9 min readLW link
(youtu.be)

Gra­di­ent rout­ing is bet­ter than pre­train­ing filtering

Cleo Nardo2 Sep 2025 9:05 UTC
46 points
3 comments5 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam Marks18 Apr 2024 16:17 UTC
113 points
10 comments12 min readLW link

AXRP Epi­sode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

DanielFilan24 Aug 2024 22:30 UTC
21 points
0 comments74 min readLW link

Scal­able over­sight as a quan­ti­ta­tive rather than qual­i­ta­tive problem

Buck6 Jul 2024 17:42 UTC
86 points
11 comments3 min readLW link

Is AI Align­ment Miss­ing an In­ter­pre­tive Layer? A Canon-Based Frame­work for Govern­ing Model Reasoning

Jason Young20 Oct 2025 19:33 UTC
1 point
0 comments3 min readLW link

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

6 Dec 2024 22:19 UTC
169 points
14 comments11 min readLW link
(arxiv.org)

Prover-Es­ti­ma­tor De­bate: A New Scal­able Over­sight Protocol

17 Jun 2025 13:53 UTC
88 points
18 comments5 min readLW link

NYU Code De­bates Up­date/​Postmortem

David Rein24 May 2024 16:08 UTC
27 points
4 comments10 min readLW link

Eval­u­at­ing Over­sight Ro­bust­ness with In­cen­tivized Re­ward Hacking

20 Apr 2025 16:53 UTC
7 points
2 comments15 min readLW link

Model Pa­ram­e­ters as a Stegano­graphic Pri­vate Channel

Lennart Finke27 Oct 2025 16:08 UTC
7 points
0 comments5 min readLW link

On scal­able over­sight with weak LLMs judg­ing strong LLMs

8 Jul 2024 8:59 UTC
49 points
18 comments7 min readLW link
(arxiv.org)

Can We Trust the Judge? A novel method of Model­ling Hu­man Bias and Sys­tem­atic Er­ror in De­bate-Based Scal­able Oversight

Andreea Zaman19 Jul 2025 21:44 UTC
1 point
0 comments7 min readLW link

Re­in­force­ment Learn­ing from In­for­ma­tion Bazaar Feed­back, and other uses of in­for­ma­tion markets

Abhimanyu Pallavi Sudhir16 Sep 2024 1:04 UTC
5 points
2 comments5 min readLW link

Hu­man-AI Com­ple­men­tar­ity: A Goal for Am­plified Oversight

24 Dec 2024 9:57 UTC
27 points
4 comments1 min readLW link
(deepmindsafetyresearch.medium.com)

An artis­tic illus­tra­tion of Scal­able Over­sight—“A world apart, nei­ther gods nor mor­tals”

Marius Adrian Nicoară16 Apr 2025 12:41 UTC
1 point
0 comments1 min readLW link

Au­to­mated mon­i­tor­ing systems

hiki_t28 Nov 2024 18:54 UTC
1 point
0 comments2 min readLW link

[Question] Is weak-to-strong gen­er­al­iza­tion an al­ign­ment tech­nique?

cloud31 Jan 2025 7:13 UTC
22 points
1 comment2 min readLW link

In­tro­duc­ing the Wis­dom Forc­ing Func­tion™: An In­no­va­tion Div­i­dend from Dialec­ti­cal Align­ment

CarlosArleo5 Oct 2025 20:13 UTC
1 point
0 comments1 min readLW link

[Re­search Note] Op­ti­miz­ing The Fi­nal Out­put Can Obfus­cate CoT

30 Jul 2025 21:26 UTC
197 points
22 comments6 min readLW link
No comments.