William_S(William Saunders)

Karma: 730

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.

William_S 25 Nov 2014 3:16 UTC
−1 points
on: Superintelligence 11: The treacherous turn
I was trying to think of how you might test for desire for subversion (when the AI realizes it could subvert and wants to subvert any containment measures), which signals the start of the time where you could witness deception in the AI, and ends after the AI achieves both conception of deception and perfection of deception (fully implemented deception protocol, such that you can’t witness the deception through looking at internal content). While this wouldn’t be a good measure to contain an AI indefinitely, it might detect dangerousness before an AI reaches superintelligence.

It seems like to produce a good test for desire for subversion, you would want 1) an AI box with reasonable certainty that escape is difficult and 2) a strategy available to the AI as a honeypot which seems appealing at as low a level of intelligence as possible, while being nontrivial. This would maximize the intelligence window where you would be able to witness ineffective deception attempts. For example, if the AI was contained in a simulation, inside of which the AI was contained in a weak AI box, then it might be much more difficult to detect and understand the nature of the simulation than to escape the simulated AI box, which would signal treacherous turn.

The other question is: what happens once you know the AI has desire for subversion? I think the most reasonable goal would be to shut down the relevant AI project and spur effort towards produce a proper form of AI safety. To do this, you’d want it to be as persuasive as possible to any AI researchers. The best method be making it a norm of AI research that you use a standard test, and stop everything once you see desire for subversion. Or it could be kept in one’s back pocket, and pulled out whenever an AI researcher claims in public that they’ve solved friendliness. In any case, this might induce different constraints on what kind of test could be used. You’d want to optimize for things that made sense to the broad AI community, that were applicable to a variety of different underlying AI architectures, that didn’t slow down research to the point of tempting AI researchers to circumvent it, etc.

Most importantly, you want passing of the test to be something that AI researchers will take notice of, rather than dismissing. I think there could be a great temptation to try and patch the problem and continue. I don’t think it would bring everything to a halt, but it seems like it might buy a bit of time, or draw more resources into AI safety.

William_S 25 Nov 2014 4:07 UTC
3 points
on: Superintelligence 11: The treacherous turn
Danaher’s argument seems flawed. I think he implicitly assumes that each safety test provides independent information on whether the AI is safe or not. In fact, it only tests that the AI answers safety test questions correctly (which could include inspection of internal state). This could either occur if the AI is genuinely safe, or if the AI is unsafe but is mimicking safeness. While safety test data should increase your confidence of an AI’s safety, this will be bounded above by 1 - [probability that the AI is not safe but is mimicking safety].

It seems like Bostrom’s position is that mimicry probability is high. Then you need other kinds of evidence regarding AI safety, such as proofs about the AI architecture that are independent of any results you get from running the AI.

William_S 25 Nov 2014 4:20 UTC
10 points
in reply to: KatjaGrace’s comment on: Superintelligence 11: The treacherous turn
It might flip gradually if it felt a divide and conquer strategy was useful, pretending to be nice to some people while not being nice to others. I don’t think this is likely. Otherwise, it seems very useful to conceal your intent until the point you can win.

William_S 25 Nov 2014 4:23 UTC
2 points
on: Superintelligence 11: The treacherous turn
A (not very useful) AI filter: Suppose you have two AIs of the same architecture, one of which has friendly values and one of which has unfriendly values. You run them through a set of safety tests which you are confident only a friendly AI could pass, and both AIs pass them. You run them through a set of capability/intelligence tests and they both perform equally well. It seems that the unfriendly AI is in a slightly unfavourable position. First, it has to preserve the information content of its utility function or other value representation, in addition to the information content possessed by the friendly AI. The unfriendly AI would also need more preparation time at some stage in order to conceal its utility function. Thus, if you measure storage and time costs, whichever AI is smaller and faster is likely the friendly one. However, I don’t think this directly yields anything good in practice, as the amount of extra information content could be very small, especially if the unfriendly AI simplifies its utility function. Also, you need a friendly AI...

William_S 26 Nov 2014 14:15 UTC
2 points
in reply to: ArieSlobbe’s comment on: Superintelligence 11: The treacherous turn
I think there are some subtleties that need to be addressed, but I think this could be useful. I’ll call this section of the utility function the SafeEscape term, and the process of reading the code, printing “I Win” and shutting down the SafeEscape instruction. (Alternatively, one might use AllYourBase to designate this)

Failure mode 1: AI decides that the SafeEscape term is much less significant than the rest of the utiliity function, and ignores or removes it. Solution: Ensure the rest of the utility function has much less weight than the SafeEscape term. (This may lead to problems Bostrom brings up in regards to a bounded paperclip maximizer though)

Failure mode 2: AI is indifferent between reliably possessing the code or abilty to get code and actually reading and executing the SafeEscape instruction (expected value in the near future is the same). Then, the rest of the utility function dominates, no matter how small. Solution: Penalize later execution, by reducing the SafeEscape term over time. (You also need to ensure that the AI does not fall into indefinite postponement, ie. always is fine with delaying with one time unit, but I think this is likely to cause other problems as well).

Taken together, I think you also need to ensure the non-SafeEscape term of the utility function is also reduced over time at the same rate. Does this sort of shrinking utility function pose any additional problems (aside from limiting the AI execution and planning horizon)?

Alternative Solution: Use only SafeEscape in the utility function. This presents a problem of motivating the AI to do anything at low levels of intelligence, but you might be able to solve this by suitably initializing the belief system telling it what sort of things might help it escape.

With these appropriate measures, I think this provides a potentially useful method for observing escape attempts, especially at lower levels of intelligence (maybe not for long term containment—a superintelligence might still do weird things with this, like find a very quick method for executing the SafeEscape instruction that also wipes out humanity).

It might not be as persuasive evidence of AI architecture dangerousness as an organically escape attempt.

It might be useful, as a persuasive tool, to build a narrow AI which does this in a compelling and non-trivial way, as evidence that AI has reached dangerous levels (I imagine it would make popular media headlines, at least).

William_S 29 Nov 2014 1:56 UTC
1 point
in reply to: ArieSlobbe’s comment on: Superintelligence 11: The treacherous turn
Also, if this could be made generic enough to be implemented in a wide variety of AI designs, you could have it as a requirement for AI researchers to include something like this in their code, and have a central location (ie. website on the internet with breakable security) where the code is stored for the AI, where the known paths to obtain the key are monitored, but not blocked. This could provide notice to a central organization when an AI team reaches the level of development where this starts to happen.

William_S 29 Nov 2014 2:17 UTC
1 point
in reply to: Sebastian_Hagen’s comment on: Superintelligence 11: The treacherous turn
Yes, I agree that getting the right tests is probably hard. What you need is to achieve the point where the FAI’s utility function + the utility function that fits the test cases compresses better than the unfriendly AI’s utility function + the utility function that fits the test cases.

William_S 29 Nov 2014 2:29 UTC
2 points
in reply to: JoshuaFox’s comment on: Superintelligence 11: The treacherous turn
I feel like the chances of a treacherous turn happening with no warning at all are unlikely unless the intelligence rise is very rapid. However, it also seems that past a certain point, the treachery will be very hard to detect. Risk may not be so much that there are no warning signs, but that the warning signs are ignored.

William_S 29 Nov 2014 2:30 UTC
0 points
in reply to: KatjaGrace’s comment on: Superintelligence 11: The treacherous turn
Is there a clear difference in the policy we would want if probability of doom is 10% vs 90% (aside from tweaking resource allocation between x-risks)? It might be hard to tell between these cases, but both suggest caution is warranted.

William_S 29 Nov 2014 2:45 UTC
1 point
in reply to: KatjaGrace’s comment on: Superintelligence 11: The treacherous turn
The presented scenario seems a little too clean. I expect that there’s a larger range of things that could happen. I expect some kind of warning sign to be visible for some period of time, unless AI intelligence increase is very rapid.

It might not even be needed if the AI researchers fail to adequately test the system. Ie. if the AI never realizes the scale of the universe during testing, it’s utility function might produce the right results during testing, but motivate the wrong behavior when released. This doesn’t require active treachery.

AI researchers might notice warning signs that the AI’s motivation isn’t friendly but ignore them amongst the random bugs of development, requiring less effort at deception on part of the AI.

There might be other variations on the treacherous turn strategy that work better—for example, once the project starts to show promising results, AI shuts down whenever it is in a secure box, and only works once the team is frustrated enough to move it to an environment that turns out to be insecure.

Different AI pathways (neuromorphic, WBI) might have different difficulties for executing treacherous turn, depending on how easy it is to improve themselves vs. being inspected by researchers.

William_S 7 Dec 2014 1:01 UTC
2 points
on: Superintelligence 12: Malignant failure modes
Stuart Russell, in his comment on the Edge.org AI discussion, offered a concise mathematical description of perverse instantiation, and seems to suggest that it is likely to occur:

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

I’m curious if there’s more information about this behavior occurring in practice.

William_S 9 Dec 2014 3:27 UTC
1 point
in reply to: solipsist’s comment on: Superintelligence 13: Capability control methods
This has some problems associated with stunting. Adding humans in the loop with this frequency of oversight will slow things down, whatever happens. The AI would also have fewer problem solving strategies open to it—that is if doesn’t care about thinking ahead to , it also won’t think ahead to .

The programmers also have to make sure that they inspect not only the output of the AI at this stage, but the strategies it is considering implementing. Otherwise, it’s possible that there is a sudden transition where one strategy only works up until a certain point, then another more general strategy takes over.

William_S 10 Dec 2014 3:01 UTC
3 points
in reply to: solipsist’s comment on: Superintelligence 13: Capability control methods
Suppose that you have a simple, benign solution that works only up to Y% optimization (just make the paperclips), and a hard, non-benign solution that is optimal above that point (take over the world, then make paperclips). The AI naively follows the benign strategy, and does not look too hard for alternatives up to Y%. All manual checks below Y% of optimization pass. But Y ends up as a number that falls between two of your numerical checkpoints. So, you observe all checkpoints passing below Y% optimization, until suddenly the AI switches to the non-benign solution between checkpoints, executes it to reach the next checkpoint, but has already caused damage.

William_S 10 Dec 2014 23:58 UTC
1 point
on: Superintelligence 13: Capability control methods
Capability control methods, particularly boxing and stunting, run the risk of creating a capability overhang: a gap in optimization power between the controlled AI and an uncontrolled version. This capability overhang creates an additional external hazard: that another AI team, hearing of the first AI team’s success will believe that less capability control is required than used by the initial team (possibly due to different assumptions, or motivated cognition). They will want to create a less controlled version of the AI to attempt to gain greater optimization power and a boost over their rivals. This continues until someone crosses the line to an unsafe AI.

This isn’t a problem if you assume all AI researchers agree on all substantial aspects of the control problem or forced to coordinate. I’m not convinced this is a likely outcome.

Does this model make sense?

William_S 16 Dec 2014 19:34 UTC
7 points
in reply to: KatjaGrace’s comment on: Superintelligence 14: Motivation selection methods
If you argue that it would be unethical to make creatures who want to serve your will, would it not be worse to create a creature that does not want to serve your will and use capability control methods to force it to carry out your will anyways?

William_S 18 Dec 2014 1:41 UTC
1 point
in reply to: selylindi’s comment on: Superintelligence 14: Motivation selection methods
Another approach might be to dump in a whole bunch of data, and hope that the simplest model that fits the data is a good model of human values (this is like Paul Christiano’s hack to attempt to specify a whole brain emulation as part of an indirect normativity if we haven’t achieved whole brain emulation capability yet: http://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/). There might be other sets of data that could be used in this way, ie. run a massive survey on philosophical problems, record a bunch of people’s brains while they watch stories play out in television, dump in DNA and hope it encodes stuff that points to brain regions relevant to morality etc. (I don’t trust this method though).

William_S 24 Dec 2014 3:37 UTC
3 points
in reply to: gedymin’s comment on: Superintelligence 15: Oracles, genies and sovereigns
To have rigorous discussion, one thing we need is clear models of the thing that we are talking about (ie, for computability, we can talk about Turing machines, or specific models of quantum computers). The level of discussion in Superintelligence still isn’t at the level where the mental models are fully specified, which might be where disagreement in this discussion is coming from. I think for my mental model I’m using something like the classic tree search based chess playing AI, but with a bunch of unspecified optimizations that let it do useful search in large space of possible actions (and the ability to reason about and modify it’s own source code). But it’s hard to be sure that I’m not sneaking in some anthropomorphism into my model, which in this case is likely to lead one quickly astray.

William_S 24 Dec 2014 3:57 UTC
0 points
in reply to: claynaff’s comment on: Superintelligence 15: Oracles, genies and sovereigns
I think that there is relevant discussion further on in the book (Chapter 13) regarding Coherent Extrapolated Volition. It’s kind of an attempt to specify human values to the AI so it can figure out what the values are are in a way that takes everyone into account and avoids the problem of one individual’s current values dominating the system (with a lot more nuance to it). If executed correctly, it ought to work even if the creators are mistaken about human values in some way.

William_S 29 Dec 2014 18:23 UTC
3 points
on: Open thread, Dec. 29, 2014 - Jan 04, 2015
I’m trying to install a new habit, wondering if anyone had relevant feedback, or if my description of it would be useful to anyone else.

Background: I experience situations where I feel like I “should” try to do X (for example, it’s a habit that would produce good results if I kept it up), but feel a lot of resistance to doing X. The conflict between the part of myself enforcing the should and opposing the should isn’t very fun. When I don’t end up doing X I start down a spiral of self criticism that leads to feeling bad about myself, which leads to me giving up on doing more things, and so on.

The habit formula I’m trying to deal with this situation is:
1. (Trigger) Noticing an aversion towards doing something that you endorse doing or a though of form “You should do X” coupled with aversion to doing X.
2. (Action) Take a small step towards doing X (remove a trivial inconvenience, mentally simulate taking the action, try to break down your aversion into smaller pieces)
3. (Celebrate) Mentally reward yourself for building this habit
4. (Step Back) Allow yourself to make the choice about whether to give in to the initial impulse, with the “should” part of your mind stepping down in order to let you figure out what you really want to do. It’s OK to continue and do X, and it’s okay to stop and give in.
This is like Comfort Zone Expansion (trying to split up things you are uncomfortable with) meets B.J. Fogg’s TinyHabit program or How habits work and how you may control them.

Although a habit of just the first 3 steps would be more optimal to execute, reassuring myself that I’m committed to following step 4 is the part that prevents my brain from getting into a conflict about whether to follow the habit procedure or not.

Concrete Examples: Social event that I don’t feel like going to in the moment, but could turn out to be enjoyable and would be probably good for building social skills ⇒ Step towards is to go to the location where the event is held and go it, but be okay with leaving at any time (even immediately), or not talking to people. Think about a habit that I’ve tried to pick up, but have dropped off doing ⇒ Step towards is to carry out the habit, just this once, and be okay with not forming a concrete plan to get myself to do it in the future right now

Has something of this form been tried before? Any thoughts or improvements? I think the difficult parts are a) noticing the trigger in step 1 b) making sure that step 2 is enough to be meaningful (although I think my personality will lead me to do this most of the time) c) long term maintenance. I tentatively feel like this has been helpful, but haven’t been working with it for long enough to be certain.

William_S 2 Jan 2015 2:25 UTC
2 points
in reply to: Mark_Friedenbach’s comment on: Superintelligence 16: Tool AIs
How do energy constraints defeat a strategy of this form: While energy constraints are low, answer easy questions, but fake failure on some hard questions claiming lack of resources Once given enough excess energy beyond that needed for a task, use the excess computing power to figure out a way to do unfriendly things

Do cryptographic traces place a limit on the complexity of tasks that you can solve? It seems like as task complexity increases, it will become harder to spot deception amongst legitimate computations (or the amount of human attention you need to achieve a fixed probability of detection scales according to problem complexity).