AI Oversight

TagLast edit: Sep 20, 2022, 6:56 PM by RobertM

AI Oversight as described by Doing oversight from the very start of training seems hard:

Here ‘oversight’ means there is something with access to the internals of the model which checks that the model isn’t misaligned even if the behavior on the training distribution looks fine. In An overview of 11 proposals for building safe advanced AI, all but two of the proposals basically look like this, as does AI safety via market making.
Examples of oversight techniques include:
Transparency tools (either used by a human, an AI, or a human assisted by an AI)
Adversarial inputs (giving inputs which could trick a misaligned AI into revealing itself)
Relaxed adversarial training (which could be seen as an extension of adversarial inputs)

Quick thoughts on “scalable oversight” / “super-human feedback” research

David Scott Krueger (formerly: capybaralet)Jan 25, 2023, 12:55 PM

27 points

9 comments2 min readLW link

Oversight Misses 100% of Thoughts The AI Does Not Think

johnswentworthAug 12, 2022, 4:30 PM

111 points

49 comments1 min readLW link

[Question] Is there any existing term summarizing non-scalable oversight methods in outer alignment?

Allen ShenJul 31, 2023, 5:31 PM

1 point

0 comments1 min readLW link

AI Safety Oversights

Davey MorseFeb 8, 2025, 6:15 AM

3 points

0 comments1 min readLW link

Human-AI Complementarity: A Goal for Amplified Oversight

rishubjain and Sophie Bridgers

Dec 24, 2024, 9:57 AM

27 points

4 comments1 min readLW link

(deepmindsafetyresearch.medium.com)

W2SG: Introduction

Maria KaprosMar 10, 2024, 4:25 PM

2 points

2 comments10 min readLW link

Doing oversight from the very start of training seems hard

peterbarnettSep 20, 2022, 5:21 PM

14 points

3 comments3 min readLW link

Trying to measure AI deception capabilities using temporary simulation fine-tuning

alenoachMay 4, 2023, 5:59 PM

4 points

0 comments7 min readLW link

Activation Magnitudes Matter On Their Own: Insights from Language Model Distributional Analysis

Matt LevinsonJan 10, 2025, 6:53 AM

4 points

0 comments4 min readLW link

Alignment Structure Direction—Recursive Adversarial Oversight(RAO)

Jayden ShepardMay 2, 2025, 5:51 PM

2 points

0 comments2 min readLW link

Measuring and Improving the Faithfulness of Model-Generated Reasoning

Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman and Ethan Perez

Jul 18, 2023, 4:36 PM

111 points

15 comments6 min readLW link 1 review

The weak-to-strong generalization (WTSG) paper in 60 seconds

sudoJan 16, 2024, 10:44 PM

12 points

1 comment1 min readLW link

(arxiv.org)

No comments.