RSS

AI Oversight

TagLast edit: 20 Sep 2022 18:56 UTC by RobertM

AI Oversight as described by Doing oversight from the very start of training seems hard:

Here ‘oversight’ means there is something with access to the internals of the model which checks that the model isn’t misaligned even if the behavior on the training distribution looks fine. In An overview of 11 proposals for building safe advanced AI, all but two of the proposals basically look like this, as does AI safety via market making.

Examples of oversight techniques include:

  • Transparency tools (either used by a human, an AI, or a human assisted by an AI)

  • Adversarial inputs (giving inputs which could trick a misaligned AI into revealing itself)

  • Relaxed adversarial training (which could be seen as an extension of adversarial inputs)

Quick thoughts on “scal­able over­sight” /​ “su­per-hu­man feed­back” research

David Scott Krueger (formerly: capybaralet)25 Jan 2023 12:55 UTC
27 points
9 comments2 min readLW link

No, We’re Not Get­ting Mean­ingful Over­sight of AI

Davidmanheim9 Jul 2025 11:10 UTC
48 points
4 comments1 min readLW link
(arxiv.org)

Over­sight Misses 100% of Thoughts The AI Does Not Think

johnswentworth12 Aug 2022 16:30 UTC
122 points
49 comments1 min readLW link

[Question] Is there any ex­ist­ing term sum­ma­riz­ing non-scal­able over­sight meth­ods in outer al­ign­ment?

Allen Shen31 Jul 2023 17:31 UTC
1 point
0 comments1 min readLW link

AI Safety Oversights

Davey Morse8 Feb 2025 6:15 UTC
3 points
0 comments1 min readLW link

Hu­man-AI Com­ple­men­tar­ity: A Goal for Am­plified Oversight

24 Dec 2024 9:57 UTC
27 points
4 comments1 min readLW link
(deepmindsafetyresearch.medium.com)

W2SG: Introduction

Maria Kapros10 Mar 2024 16:25 UTC
2 points
2 comments10 min readLW link

Do­ing over­sight from the very start of train­ing seems hard

peterbarnett20 Sep 2022 17:21 UTC
14 points
3 comments3 min readLW link

Try­ing to mea­sure AI de­cep­tion ca­pa­bil­ities us­ing tem­po­rary simu­la­tion fine-tuning

alenoach4 May 2023 17:59 UTC
4 points
0 comments7 min readLW link

Ac­ti­va­tion Mag­ni­tudes Mat­ter On Their Own: In­sights from Lan­guage Model Distri­bu­tional Analysis

Matt Levinson10 Jan 2025 6:53 UTC
4 points
0 comments4 min readLW link

Lay­ered Re­ward Mod­ifiers for Trans­par­ent and Self-Cor­rect­ing AI

RyanC5 Nov 2025 3:06 UTC
1 point
0 comments8 min readLW link

Align­ment Struc­ture Direc­tion—Re­cur­sive Ad­ver­sar­ial Over­sight(RAO)

Jayden Shepard2 May 2025 17:51 UTC
2 points
0 comments2 min readLW link

Build­ing Black-box Schem­ing Monitors

29 Jul 2025 17:41 UTC
43 points
18 comments11 min readLW link

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

18 Jul 2023 16:36 UTC
111 points
15 comments6 min readLW link1 review

Model Pa­ram­e­ters as a Stegano­graphic Pri­vate Channel

Lennart Finke27 Oct 2025 16:08 UTC
9 points
0 comments5 min readLW link

The weak-to-strong gen­er­al­iza­tion (WTSG) pa­per in 60 seconds

sudo16 Jan 2024 22:44 UTC
12 points
1 comment1 min readLW link
(arxiv.org)
No comments.