William_S

Karma: 1,868

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.

William_S 30 Mar 2025 14:36 UTC
2 points
0
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
Would be interested in a quick write-up of what you think are the most important virtues you’d want for AI systems, seems good in terms of having things to aim towards instead of just aiming away from.

William_S 15 Mar 2025 22:07 UTC
2 points
0
in reply to: William_S’s comment on: William_S’s Shortform
Initial version for firefox, code at https://github.com/william-r-s/MindfulBlocker, extension file at https://github.com/william-r-s/MindfulBlocker/releases/tag/v0.2.0

William_S 12 Mar 2025 18:27 UTC
LW: 2 AF: 1
0
AF
in reply to: William_S’s comment on: Daniel Kokotajlo’s Shortform
Maybe there’s an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis

William_S 12 Mar 2025 18:22 UTC
LW: 5 AF: 4
2
AF
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there’d be snickering news articles about it. So if some individuals could do this independently might be easier

William_S 12 Mar 2025 18:07 UTC
LW: 2 AF: 1
0
AF
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.

William_S 16 Feb 2025 23:10 UTC
2 points
0
in reply to: William_S’s comment on: Principles for the AGI Race
I think it’s somewhat blameworthy to not think about these questions at all though

William_S 16 Feb 2025 22:23 UTC
3 points
0
on: Principles for the AGI Race
On reflection there was something missing from my perspective here, which is that taking any action based on principles depends on pragmatic considerations, like if you leave are there better alternatives? How much power do you really have? I think I don’t fault someone who thinks this through and decides that something is wrong but there’s no real way to do anything about it. I do think you should try to maintain some sense of what is wrong and what the right direction would be, look out for ways to push in that direction. E.g. working at a lab but maintaining some sense of “this is how much of a chance it looks like pause activism would need before I’d quite and endorse a pause”.

William_S 16 Feb 2025 22:13 UTC
2 points
0
in reply to: ryan_greenblatt’s comment on: Principles for the AGI Race
I think I was just conflating different kinds of decisions here, and imagining arguing with people with very different conceptions of what are important to count in costs and benefits, and a bit confused. On reflection I don’t endorse 10x margin in terms of like percentage points of x-risk. And like maybe margin is sort of a crutch, maybe the thing I want more is like “95% chance of being net-positive, considering possibility you’re kind of biased”. I still think you should be suspicious of “the case exactly balance lets ship’

William_S 16 Feb 2025 22:09 UTC
2 points
0
in reply to: Raemon’s comment on: Principles for the AGI Race
Yeah this part is pretty under-defined, I was maybe falling into the trap of being too idealistic, and I’m probably less optimistic about this than I was when writing it before. I think there’s something directionally important here, are you trying at all to expand the circle of accountability at all, even if you’re being cautious about expanding it because you’re afraid of things breaking down?

William_S 16 Feb 2025 21:23 UTC
4 points
2
on: 6 (Potential) Misconceptions about AI Intellectuals
Would be nice to have a llm+prompt that tries to produce reasonable AI strategy advice based on a summary of the current state of play, have some way to validate that it’s reasonable, be able to see how it updates as events unfold.

William_S 16 Feb 2025 21:20 UTC
4 points
2
on: 6 (Potential) Misconceptions about AI Intellectuals
A couple advantages for AI intellectuals could be:
- being able to rerun based on different inputs, see how their analysis changes function of those inputs
- being able to view full reasoning traces (while also not the full story, probably more of the full story than what goes on with human reasoning, good intellectuals already try to share their process but maybe can do better/use this to weed out clearly bad approaches)

William_S 16 Feb 2025 19:47 UTC
11 points
5
in reply to: cubefox’s comment on: William_S’s Shortform
Yep, I’ve used those, with some effectiveness but also tend to just like get used to it over time, form a habit of mindlessly jumping through the hoops. Hypothesis here is that having to justify what you’re doing would be more effective at changing habits.

William_S 16 Feb 2025 18:32 UTC
72 points
40
on: William_S’s Shortform
LLM-based application I’d like to exist:
Web browser addon for firefox that has blocklists of websites, when you try to visit one you have to have a conversation with Claude about why you want to visit it in this moment, convince Claude to let you bypass the block for a limited period of time for your specific purpose (let you customize the claude prompt with info about why you set up the block in the first place).
Wanting to use for things like news, social media where it’s a bit too much to try to completely block, but I’ve got bad habits around checking too frequently.
Bonus: be able to let the LLM read the website for you and answer questions without showing you the page, like is there anything new about X.

Principles for the AGI Race

William_S30 Aug 2024 14:29 UTC

248 points

17 comments18 min readLW link

William_S 15 Jul 2024 19:25 UTC
12 points
−5
on: I found >800 orthogonal “write code” steering vectors
Hypothesis: each of these vectors representing a single token that is usually associated with code, vectors says “I should output this token soon”, and the model then plans around that to produce code. But adding vectors representing code tokens doesn’t necessarily produce another vector representing a code token, so that’s why you don’t see compositionality. Does somewhat seem plausible that there might be ~800 “code tokens” in the representation space.

Transformer Circuit Faithfulness Metrics Are Not Robust

Joseph Miller, bilalchughtai and William_S

12 Jul 2024 3:47 UTC

104 points

5 comments7 min readLW link

(arxiv.org)

William_S 5 Jul 2024 23:56 UTC
5 points
5
in reply to: Raemon’s comment on: Habryka’s Shortform Feed
Absent evidence to the contrary, for any organization one should assume board members were basically selected by the CEO. So hard to get assurance about true independence, but it seems good to at least to talk to someone who isn’t a family member/close friend.

William_S 5 Jul 2024 17:53 UTC
11 points
19
in reply to: Zac Hatfield-Dodds’s comment on: Habryka’s Shortform Feed
Good that it’s clear who it goes to, though if I was an anthropic I’d want an option to escalate to a board member who isn’t Dario or Daniella, in case I had concerns related to the CEO

William_S 5 Jul 2024 17:33 UTC
9 points
7
in reply to: Ideopunk’s comment on: 80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)
I do think 80k should have more context on OpenAI but also any other organization that seems bad with maybe useful roles. I think people can fail to realize the organizational context if it isn’t pointed out and they only read the company’s PR.

William_S 1 Jul 2024 18:59 UTC
31 points
17
in reply to: habryka’s comment on: Habryka’s Shortform Feed
I agree that this kind of legal contract is bad, and Anthropic should do better. I think there are a number of aggrevating factors which made the OpenAI situation extrodinarily bad, and I’m not sure how much these might obtain regarding Anthropic (at least one comment from another departing employee about not being offered this kind of contract suggest the practice is less widespread).

-amount of money at stake
-taking money, equity or other things the employee believed they already owned if the employee doesn’t sign the contract, vs. offering them something new (IANAL but in some cases, this could be a felony “grand theft wages” under California law if a threat to withhold wages for not signing a contract is actually carried out, what kinds of equity count as wages would be a complex legal question)
-is this offered to everyone, or only under circumstances where there’s a reasonable justification?
-is this only offered when someone is fired or also when someone resigns?
-to what degree are the policies of offering contracts concealed from employees?
-if someone asks to obtain legal advice and/or negotiate before signing, does the company allow this?
-if this becomes public, does the company try to deflect/minimize/only address issues that are made publically, or do they fix the whole situation?
-is this close to “standard practice” (which doesn’t make it right, but makes it at least seem less deliberately malicious), or is it worse than standard practice?
-are there carveouts that reduce the scope of the non-disparagement clause (explicitly allow some kinds of speech, overriding the non-disparagement)?
-are there substantive concerns that the employee has at the time of signing the contract, that the agreement would prevent discussing?
-are there other ways the company could retaliate against an employee/departing employee who challenges the legality of contract?

I think with termination agreements on being fired there’s often 1. some amount of severance offered 2. a clause that says “the terms and monetary amounts of this agreement are confidential” or similar. I don’t know how often this also includes non-disparagement. I expect that most non-disparagement agreements don’t have a term or limits on what is covered.

I think a steelman of this kind of contract is: Suppose you fire someone, believe you have good reasons to fire them, and you think that them loudly talking about how it was unfair that you fired them would unfairly harm your company’s reputation. Then it seems somewhat reasonable to offer someone money in exchange for “don’t complain about being fired”. The person who was fired can then decide whether talking about it is worth more than the money being offered.

However, you could accomplish this with a much more limited contract, ideally one that lets you disclose “I signed a legal agreement in exchange for money to not complain about being fired”, and doesn’t cover cases where “years later, you decide the company is doing the wrong thing based on public information and want to talk about that publically” or similar.

I think it is not in the nature of most corporate lawyers to think about “is this agreement giving me too much power?” and most employees facing such an agreement just sign it without considering negotiating or challenging the terms.

For any future employer, I will ask about their policies for termination contracts before I join (as this is when you have the most leverage, if they give you an offer they want to convince you to join).

William_S

Prin­ci­ples for the AGI Race

Trans­former Cir­cuit Faith­ful­ness Met­rics Are Not Robust

Principles for the AGI Race

Transformer Circuit Faithfulness Metrics Are Not Robust