RSS

William_S(William Saunders)

Karma: 809

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.

Pro­posal for an Im­ple­mentable Toy Model of In­formed Oversight

William_S24 May 2017 17:43 UTC
2 points
1 comment1 min readLW link
(william-r-s.github.io)

In­formed Over­sight through Gen­er­al­iz­ing Explanations

William_S24 May 2017 17:43 UTC
2 points
0 comments1 min readLW link
(william-r-s.github.io)

Im­prob­a­ble Over­sight, An At­tempt at In­formed Oversight

William_S24 May 2017 17:43 UTC
3 points
9 comments1 min readLW link
(william-r-s.github.io)

Un­der­stand­ing Iter­ated Distil­la­tion and Am­plifi­ca­tion: Claims and Oversight

William_S17 Apr 2018 22:36 UTC
34 points
30 comments9 min readLW link

Am­plifi­ca­tion Dis­cus­sion Notes

William_S1 Jun 2018 19:03 UTC
17 points
3 comments3 min readLW link

HCH is not just Me­chan­i­cal Turk

William_S9 Feb 2019 0:46 UTC
42 points
6 comments3 min readLW link

Re­in­force­ment Learn­ing in the Iter­ated Am­plifi­ca­tion Framework

William_S9 Feb 2019 0:56 UTC
25 points
12 comments4 min readLW link

[Question] Is there an in­tu­itive way to ex­plain how much bet­ter su­perfore­cast­ers are than reg­u­lar fore­cast­ers?

William_S19 Feb 2020 1:07 UTC
16 points
5 comments1 min readLW link

Thoughts on re­fus­ing harm­ful re­quests to large lan­guage models

William_S19 Jan 2023 19:49 UTC
31 points
4 comments2 min readLW link