AI Oversight

TagLast edit: 20 Sep 2022 18:56 UTC by RobertM

AI Oversight as described by Doing oversight from the very start of training seems hard:

Here ‘oversight’ means there is something with access to the internals of the model which checks that the model isn’t misaligned even if the behavior on the training distribution looks fine. In An overview of 11 proposals for building safe advanced AI, all but two of the proposals basically look like this, as does AI safety via market making.

Examples of oversight techniques include:

  • Transparency tools (either used by a human, an AI, or a human assisted by an AI)

  • Adversarial inputs (giving inputs which could trick a misaligned AI into revealing itself)

  • Relaxed adversarial training (which could be seen as an extension of adversarial inputs)

Over­sight Misses 100% of Thoughts The AI Does Not Think

johnswentworth12 Aug 2022 16:30 UTC
97 points
50 comments1 min readLW link

Quick thoughts on “scal­able over­sight” /​ “su­per-hu­man feed­back” research

David Scott Krueger (formerly: capybaralet)25 Jan 2023 12:55 UTC
26 points
9 comments2 min readLW link

Try­ing to mea­sure AI de­cep­tion ca­pa­bil­ities us­ing tem­po­rary simu­la­tion fine-tuning

alenoach4 May 2023 17:59 UTC
4 points
0 comments7 min readLW link

The weak-to-strong gen­er­al­iza­tion (WTSG) pa­per in 60 seconds

sudo16 Jan 2024 22:44 UTC
12 points
1 comment1 min readLW link

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

18 Jul 2023 16:36 UTC
109 points
13 comments6 min readLW link

Do­ing over­sight from the very start of train­ing seems hard

peterbarnett20 Sep 2022 17:21 UTC
14 points
3 comments3 min readLW link

[Question] Is there any ex­ist­ing term sum­ma­riz­ing non-scal­able over­sight meth­ods in outer al­ign­ment?

Allen Shen31 Jul 2023 17:31 UTC
1 point
0 comments1 min readLW link

W2SG: Introduction

Maria Kapros10 Mar 2024 16:25 UTC
0 points
2 comments10 min readLW link
No comments.