William_S(William Saunders)

Karma: 809

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.

Proposal for an Implementable Toy Model of Informed Oversight

William_S24 May 2017 17:43 UTC

2 points

1 comment1 min readLW link

(william-r-s.github.io)

Informed Oversight through Generalizing Explanations

William_S24 May 2017 17:43 UTC

2 points

0 comments1 min readLW link

(william-r-s.github.io)

Improbable Oversight, An Attempt at Informed Oversight

William_S24 May 2017 17:43 UTC

3 points

9 comments1 min readLW link

(william-r-s.github.io)

Understanding Iterated Distillation and Amplification: Claims and Oversight

William_S17 Apr 2018 22:36 UTC

34 points

30 comments9 min readLW link

Amplification Discussion Notes

William_S1 Jun 2018 19:03 UTC

17 points

3 comments3 min readLW link

HCH is not just Mechanical Turk

William_S9 Feb 2019 0:46 UTC

42 points

6 comments3 min readLW link

Reinforcement Learning in the Iterated Amplification Framework

William_S9 Feb 2019 0:56 UTC

25 points

12 comments4 min readLW link

[Question] Is there an intuitive way to explain how much better superforecasters are than regular forecasters?

William_S19 Feb 2020 1:07 UTC

16 points

5 comments1 min readLW link

Thoughts on refusing harmful requests to large language models

William_S19 Jan 2023 19:49 UTC

31 points

4 comments2 min readLW link