Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Scalable Oversight
Tag
Last edit:
Apr 18, 2024, 7:57 PM
by
Raemon
Relevant
New
Old
Inference-Only Debate Experiments Using Math Problems
Arjun Panickssery
,
Abhimanyu Pallavi Sudhir
and
JacksonKaunismaa
Aug 6, 2024, 5:44 PM
31
points
0
comments
2
min read
LW
link
Scaling Laws for Scalable Oversight
Subhash Kantamneni
,
Josh Engels
,
David Baek
and
Max Tegmark
Apr 30, 2025, 12:13 PM
27
points
0
comments
9
min read
LW
link
Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
Sam Marks
Apr 18, 2024, 4:17 PM
113
points
10
comments
12
min read
LW
link
AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization
DanielFilan
Aug 24, 2024, 10:30 PM
21
points
0
comments
74
min read
LW
link
Scalable oversight as a quantitative rather than qualitative problem
Buck
Jul 6, 2024, 5:42 PM
85
points
11
comments
3
min read
LW
link
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
cloud
,
Jacob G-W
,
Evzen
,
Joseph Miller
and
TurnTrout
Dec 6, 2024, 10:19 PM
165
points
12
comments
11
min read
LW
link
(arxiv.org)
NYU Code Debates Update/Postmortem
David Rein
May 24, 2024, 4:08 PM
27
points
4
comments
10
min read
LW
link
Evaluating Oversight Robustness with Incentivized Reward Hacking
Yoav
,
Juan V
,
julianjm
and
McKennaFitzgerald
Apr 20, 2025, 4:53 PM
7
points
2
comments
15
min read
LW
link
On scalable oversight with weak LLMs judging strong LLMs
zac_kenton
,
Noah Siegel
,
janos
,
Jonah Brown-Cohen
,
Samuel Albanie
,
David Lindner
and
Rohin Shah
Jul 8, 2024, 8:59 AM
49
points
18
comments
7
min read
LW
link
(arxiv.org)
Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets
Abhimanyu Pallavi Sudhir
Sep 16, 2024, 1:04 AM
5
points
1
comment
5
min read
LW
link
Human-AI Complementarity: A Goal for Amplified Oversight
rishubjain
and
Sophie Bridgers
Dec 24, 2024, 9:57 AM
27
points
4
comments
1
min read
LW
link
(deepmindsafetyresearch.medium.com)
An artistic illustration of Scalable Oversight—“A world apart, neither gods nor mortals”
Marius Adrian Nicoară
Apr 16, 2025, 12:41 PM
1
point
0
comments
1
min read
LW
link
Automated monitoring systems
hiki_t
Nov 28, 2024, 6:54 PM
1
point
0
comments
2
min read
LW
link
[Question]
Is weak-to-strong generalization an alignment technique?
cloud
Jan 31, 2025, 7:13 AM
22
points
1
comment
2
min read
LW
link
No comments.
Back to top
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel