William_S(William Saunders)

Karma: 895

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.

William_S 3 Jun 2024 23:46 UTC
11 points
7
in reply to: JeremySchlatter’s comment on: Non-Disparagement Canaries for OpenAI
Language in the emails included:

“If you executed the Agreement, we write to notify you that OpenAI does not intend to enforce the Agreement”

I assume this also communicates that OpenAI doesn’t intend to enforce the self-confidentiality clause in the agreement

William_S 3 Jun 2024 23:41 UTC
5 points
−1
in reply to: aysja’s comment on: Non-Disparagement Canaries for OpenAI
Evidence could look like 1. Someone was in a position where they had to make a judgement about OpenAI and was in a position of trust 2. They said something bland and inoffensive about OpenAI 3. Later, independently you find that they likely would have known about something bad that they likely weren’t saying because of the nondisparagement agreement (instead of ordinary confidentially agreements).

This requires some model of “this specific statement was influenced by the agreement” instead of just “you never said anything bad about OpenAI because you never gave opinions on OpenAI”.

I think one should require this kind of positive evidence before calling it a “serious breach of trust”, but people can make their own judgement about where that bar should be.

William_S 3 Jun 2024 21:11 UTC
5 points
−12
in reply to: habryka’s comment on: Non-Disparagement Canaries for OpenAI
I imagine many of the people going into leadership positions were prepared to ignore the contract, or maybe even forgot about the nondisparagement clause altogether. The clause is also open to more avenues of legal attack if it’s enforced against someone who takes another position which requires disparagement (e.g. if it’s argued to be a restriction on engaging in business). And if any individual involved divested themselves of equity before taking up another position, there would be fewer ways for the company to retaliate against them. I don’t think it’s fair to view this as a serious breach of trust on behalf of any individual, without clear evidence that it impacted their decisions or communication.
But it is fair to view the situation overall as concerning that it could happen with nobody noticing, and try to design defenses to prevent this or similar things happening in the future, e.g. some clear statement around not having any conflict of interest including legal obligations for people going into independent leadership positions, as well as a consistent divestment policy (though that creates its own wierd incentives).

William_S 27 May 2024 4:09 UTC
2 points
0
in reply to: Tapatakt’s comment on: Daniel Kokotajlo’s Shortform
There is no Golden Gate Bridge division

William_S 27 May 2024 4:08 UTC
1 point
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Would make a great SCP

William_S 3 May 2024 18:18 UTC
LW: 43 AF: 16
0
AF
in reply to: habryka’s comment on: William_S’s Shortform
No comment.

William_S 3 May 2024 18:14 UTC
LW: 170 AF: 73
9
AF
on: William_S’s Shortform
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.
What links here?
- OpenAI: Exodus by Zvi (20 May 2024 13:10 UTC; 151 points)

William_S 18 Jul 2023 22:56 UTC
LW: 4 AF: 3
2
AF
on: Robustness of Model-Graded Evaluations and Automated Interpretability
Re: hidden messages in neuron explanations, yes it seems like a possible problem. A way to try to avoid this is to train the simulator model to imitate what a human would say given the explanation. A human would ignore the coded message, and so the trained simulator model should also ignore the coded message. (this maybe doesn’t account for adversarial attacks on the trained simulator model, so might need ordinary adversarial robustness methods).
Does seem like if you ever catch your interpretability assistant trying to hide messages, you should stop and try to figure out what is going on, and that might be sufficient evidence of deception.

William_S 22 Mar 2023 18:13 UTC
LW: 5 AF: 3
0
AF
on: William_S’s Shortform
From discussion with Logan Riggs (Eleuther) who worked on the tuned lens: the tuned lens suggests that the residual stream at different layers go through some linear transformations and so aren’t directly comparable. This would interfere with a couple of methods for trying to understand neurons based on weights: 1) the embedding space view 2) calculating virtual weights between neurons in different layers.
However, we could try correcting these using the transformations learned by the tuned lens to translate between the residual stream at different layers, and maybe this would make these methods more effective. By default I think the tuned lens learns only the transformation needed to predict the output token but the method could be adapted to retrodict the input token from each layer as well, we’d need both. Code for tuned lens is at https://github.com/alignmentresearch/tuned-lens

William_S’s Shortform

William_S22 Mar 2023 18:13 UTC

5 points

38 comments1 min readLW link

Thoughts on refusing harmful requests to large language models

William_S19 Jan 2023 19:49 UTC

32 points

4 comments2 min readLW link

William_S 3 Sep 2022 16:20 UTC
4 points
0
in reply to: Adam Scholl’s comment on: Common misconceptions about OpenAI
(I work at OpenAI). Is the main thing you think has the effect of safetywashing here the claim that the misconceptions are common? Like if the post was “some misconceptions I’ve encountered about OpenAI” it would mostly not have that effect? (Point 2 was edited to clarify that it wasn’t a full account of the Anthropic split.)

William_S 25 Aug 2022 16:50 UTC
13 points
2
in reply to: Chris_Leong’s comment on: OpenAI’s Alignment Plans
Jan Leike has written about inner alignment here https://aligned.substack.com/p/inner-alignment. (I’m at OpenAI, imo I’m not sure if this will work in the worst case and I’m hoping we can come up with a more robust plan)

William_S 15 Aug 2022 16:58 UTC
LW: 2 AF: 2
1
AF
in reply to: Ramana Kumar’s comment on: Oversight Misses 100% of Thoughts The AI Does Not Think
So I do think you can get feedback on the related question of “can you write a critique of this action that makes us think we wouldn’t be happy with the outcomes” as you can give a reward of 1 if you’re unhappy with the outcomes after seeing the critique, 0 otherwise.
And this alone isn’t sufficient, e.g. maybe then the AI system says things about good actions that make us think we wouldn’t be happy with the outcome, which is then where you’d need to get into recursive evaluation or debate or something. But this feels like “hard but potentially tractable problem” and not “100% doomed”. Or at least the failure story needs to involve more steps like “sure critiques will tell us that the fusion power generator will lead to everyone dying, but we ignore that because it can write a critique of any action that makes us believe it’s bad” or “the consequences are so complicated the system can’t explain them to us in the critique and get high reward for it”
ETA: So I’m assuming the story for feedback on reliably doing things in the world you’re referring to is something like “we give the AI feedback by letting it build fusion generators and then giving it a score based on how much power it generates” or something like that, and I agree this is easier than “are we actually happy with the outcome”

William_S 12 Aug 2022 23:11 UTC
LW: 3 AF: 3
1
AF
in reply to: johnswentworth’s comment on: Oversight Misses 100% of Thoughts The AI Does Not Think
If we can’t get the AI to answer something like “If we take the action you just proposed, will we be happy with the outcomes?”, why can we get it to also answer the question of “how do you design a fusion power generator?” to get a fusion power generator that does anything reliably in the world (including having consequences that kill us), rather than just getting out something that looks to us like a plan for a fusion generator but doesn’t actually work?

William_S 4 Aug 2022 21:37 UTC
LW: 1 AF: 1
AF
on: Reverse-engineering using interpretability
Image link is broken

William_S 25 Jul 2022 19:09 UTC
LW: 2 AF: 2
0
AF
in reply to: adamShimi’s comment on: Robustness to Scaling Down: More Important Than I Thought
Yep, that clarifies.

William_S 23 Jul 2022 18:03 UTC
LW: 9 AF: 9
4
AF
on: Robustness to Scaling Down: More Important Than I Thought
You define robustness to scaling down as “a solution to alignment keeps working if the AI is not optimal or perfect.” but for interpretability you talk about “our interpretability is merely good or great, but doesn’t capture everything relevant to alignment” which seems to be about the alignment approach/our understanding being flawed not the AI. I can imagine techniques being robust to imperfect AI but find it harder to imagine how any alignment approach could be robust if the approach/our implementation of the approach itself is flawed, do you have any example of this?

William_S 17 Jun 2022 4:20 UTC
LW: 1 AF: 1
AF
on: A transparency and interpretability tech tree
Summary for 8 “Can we take a deceptively aligned model and train away its deception?” seems a little harder than what we actually need, right? We could prevent a model from being deceptive rather than trying to undo arbitrary deception (e.g. if we could prevent all precursors)

William_S 17 Jun 2022 4:15 UTC
LW: 4 AF: 3
0
AF
on: A transparency and interpretability tech tree
Do you think we could basically go 1->4 and 2->5 if we could train a helper network to behaviourally clone humans using transparency tools and run the helper network over the entire network/training process? Or if we do critique style training (RL reward some helper model with access to the main model weights if it produces evidence of the property we don’t want the main network to have)?

William_S(William Saunders)

William_S’s Shortform

Thoughts on re­fus­ing harm­ful re­quests to large lan­guage models

Thoughts on refusing harmful requests to large language models