William_S

Karma: 1,870

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.

Principles for the AGI Race

William_S30 Aug 2024 14:29 UTC

248 points

17 comments18 min readLW link

Transformer Circuit Faithfulness Metrics Are Not Robust

Joseph Miller, bilalchughtai and William_S

12 Jul 2024 3:47 UTC

104 points

5 comments7 min readLW link

(arxiv.org)

William_S’s Shortform

William_S22 Mar 2023 18:13 UTC

5 points

50 comments1 min readLW link

Thoughts on refusing harmful requests to large language models

William_S19 Jan 2023 19:49 UTC

32 points

4 comments2 min readLW link

Prize for Alignment Research Tasks

stuhlmueller and William_S

29 Apr 2022 8:57 UTC

64 points

38 comments10 min readLW link

[Question] Is there an intuitive way to explain how much better superforecasters are than regular forecasters?

William_S19 Feb 2020 1:07 UTC

16 points

5 comments1 min readLW link

Machine Learning Projects on IDA

Owain_Evans, William_S and stuhlmueller

24 Jun 2019 18:38 UTC

49 points

3 comments2 min readLW link

Reinforcement Learning in the Iterated Amplification Framework

William_S9 Feb 2019 0:56 UTC

25 points

12 comments4 min readLW link

HCH is not just Mechanical Turk

William_S9 Feb 2019 0:46 UTC

42 points

6 comments3 min readLW link

Amplification Discussion Notes

William_S1 Jun 2018 19:03 UTC

17 points

3 comments3 min readLW link

Understanding Iterated Distillation and Amplification: Claims and Oversight

William_S17 Apr 2018 22:36 UTC

36 points

30 comments9 min readLW link

Improbable Oversight, An Attempt at Informed Oversight

William_S24 May 2017 17:43 UTC

3 points

9 comments1 min readLW link

(william-r-s.github.io)

Informed Oversight through Generalizing Explanations

William_S24 May 2017 17:43 UTC

2 points

0 comments1 min readLW link

(william-r-s.github.io)

Proposal for an Implementable Toy Model of Informed Oversight

William_S24 May 2017 17:43 UTC

2 points

1 comment1 min readLW link

(william-r-s.github.io)