AI Oversight

TagLast edit: 20 Sep 2022 18:56 UTC by RobertM

AI Oversight as described by Doing oversight from the very start of training seems hard:

Here ‘oversight’ means there is something with access to the internals of the model which checks that the model isn’t misaligned even if the behavior on the training distribution looks fine. In An overview of 11 proposals for building safe advanced AI, all but two of the proposals basically look like this, as does AI safety via market making.

Examples of oversight techniques include:

  • Transparency tools (either used by a human, an AI, or a human assisted by an AI)

  • Adversarial inputs (giving inputs which could trick a misaligned AI into revealing itself)

  • Relaxed adversarial training (which could be seen as an extension of adversarial inputs)

Quick thoughts on “scal­able over­sight” /​ “su­per-hu­man feed­back” research

David Scott Krueger (formerly: capybaralet)25 Jan 2023 12:55 UTC
25 points
5 comments2 min readLW link

Do­ing over­sight from the very start of train­ing seems hard

peterbarnett20 Sep 2022 17:21 UTC
14 points
3 comments3 min readLW link
No comments.