William_S(William Saunders)

Karma: 659

member of OpenAI scalable alignment team

William_S 9 Mar 2015 1:26 UTC
32 points
in reply to: shminux’s comment on: Harry Potter and the Methods of Rationality discussion thread, March 2015, chapter 117
Christmas.

William_S 11 Jun 2022 16:23 UTC
LW: 22 AF: 8
22
AF
on: AGI Ruin: A List of Lethalities
Could I put in a request to see a brain dump from Eliezer of ways to gain dignity points?

William_S 25 Aug 2022 16:50 UTC
13 points
2
in reply to: Chris_Leong’s comment on: OpenAI’s Alignment Plans
Jan Leike has written about inner alignment here https://aligned.substack.com/p/inner-alignment. (I’m at OpenAI, imo I’m not sure if this will work in the worst case and I’m hoping we can come up with a more robust plan)

William_S 3 Apr 2018 1:27 UTC
LW: 13 AF: 4
AF
on: Can corrigibility be learned safely?
From my current understanding of Paul’s IDA approach, I think there are two different senses in which corrigibility can be thought about in regards to IDA, both with different levels of guarantee.
From An unaligned benchmark
1. On average, the reward function incentivizes behaviour which competes effectively and gives the user effective control.
2. There do not exist inputs on which the policy choose an action because it is bad, or the value function outputs a high reward because the prior behaviour was bad. (Or else the policy on its own will generate bad consequences.)
3. The reward function never gives a behaviour a higher reward because it is bad. (Or else the test-time optimisation by MCTS can generate bad behaviour.) For example, if the AI deludes the human operator so that the operator can’t interfere with the AI’s behaviour, that behaviour can’t receive a higher reward even if it ultimately allows the AI to make more money.
Property 1 is dealing with “consequence corrigibility” (competence at producing actions that will produce outcomes in the world we would describe as corrigible)
Properties 2&3 are dealing with corrigibility in terms of “intent corrigibility” (guaranteeing that the system does not optimise for bad outcomes). This does not cover the agent incompetently causing bad actions in the world, only the agent deliberately trying to produce bad outcomes.
I think IDA doesn’t require or claim worst-case guarantees on the task of “consequence corrigibility” (and that this is an impossible goal for bounded reasoners).
I think that average-case good performance on “consequence corrigibility” is claimed by IDA, but only as a subset of general competence.
I think that providing worst-case guarantees on “intent corrigibility” is required and claimed by IDA.
Roughly, I think that:
- Versions of IDA that allow the overseer nodes more information could be generally competent (including predicting what behaviour could be corrigible), but could fail to be “intent corrigible”
- Versions of IDA that allow the overseer nodes only a highly restricted set of queries could be “intent corrigible” but fail to be generally competent, and hence not be “consequence corrigible”
- Standard ML approaches will, at some level of optimisation power, fail to behave “intent corrigibly” (even if you train them to be “consequence corrigible”)
The question I’m uncertain about is whether there’s a middle point in tradeoff space where both properties are sufficiently satisfied to produce good outcomes.
Do you agree or disagree with how I’ve broken down corrigibility claims for IDA, and which claims do you think your argument bears on?

William_S 6 May 2022 19:50 UTC
LW: 12 AF: 8
AF
on: The case for becoming a black-box investigator of language models
Maybe useful way to get feedback on how good you are at doing this would be trying to make predictions based on your experience with language models:
- without looking at the results or running systematic experiments on the dataset, predict which tasks on BIG-bench will be doable
- make bets of the form “we’ll reach X% performance on task A before we reach Y% performance on task B”
- predict for some prompt what percentage of samples will satisfy some property, then take a large number of samples and then rate them

William_S 6 Apr 2022 21:45 UTC
LW: 12 AF: 6
AF
in reply to: johnswentworth’s comment on: [Link] A minimal viable product for alignment
Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you’re being fooled). It’s still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.

William_S 10 Mar 2020 21:12 UTC
LW: 11 AF: 5
AF
on: Zoom In: An Introduction to Circuits
The worry I’d have about this interpretability direction is that we become very good at telling stories about what 95% of the weights in neural networks do, but the remaning 5% hides some important stuff, which could end up including things like mesa-optimizers or deception. Do you have thoughts on that?

William_S 1 Jul 2019 16:46 UTC
LW: 11 AF: 6
AF
on: Contest: $1,000 for good questions to ask to an Oracle AI
Submission: low-bandwidth oracle
Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.
What links here?
- Results of $1,000 Oracle contest! by Stuart_Armstrong (17 Jun 2020 17:44 UTC; 60 points)

William_S 25 Nov 2014 4:20 UTC
10 points
in reply to: KatjaGrace’s comment on: Superintelligence 11: The treacherous turn
It might flip gradually if it felt a divide and conquer strategy was useful, pretending to be nice to some people while not being nice to others. I don’t think this is likely. Otherwise, it seems very useful to conceal your intent until the point you can win.

William_S 23 Jul 2022 18:03 UTC
LW: 9 AF: 9
4
AF
on: Robustness to Scaling Down: More Important Than I Thought
You define robustness to scaling down as “a solution to alignment keeps working if the AI is not optimal or perfect.” but for interpretability you talk about “our interpretability is merely good or great, but doesn’t capture everything relevant to alignment” which seems to be about the alignment approach/our understanding being flawed not the AI. I can imagine techniques being robust to imperfect AI but find it harder to imagine how any alignment approach could be robust if the approach/our implementation of the approach itself is flawed, do you have any example of this?

William_S 26 Sep 2019 18:34 UTC
LW: 9 AF: 7
AF
on: 2-D Robustness
One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:
- use an inverse reinforcement learning algorithm infer the reward function of the off-distribution behaviour
- train a new system to do as well on the reward function as the original system
- measure the number of training steps needed to reach this point for the new system.
This would let you make comparisons between different systems as to which was more capability robust.
Maybe there’s a version that could train the new system using behavioural cloning, but it’s less clear how you measure when you’re as competent as the original agent (maybe using a discriminator?)
The reason for trying this is having a measure of competence that is less dependent on human judgement/closer to the systems’s ontology and capabilities.

William_S 4 Apr 2018 15:45 UTC
LW: 9 AF: 5
AF
in reply to: Wei Dai’s comment on: Can corrigibility be learned safely?
Among people I’ve had significant online discussions with, your writings on alignment tend to be the hardest to understand and easiest to misunderstand.
Additionally, I think that there are ways to misunderstand the IDA approach that leave out significant parts of the complexity (ie. IDA based off of humans thinking for a day with unrestricted input, without doing the hard work of trying to understand corrigibility and meta-philosophy beforehand), but can seem to be plausible things to talk about in terms of “solving the AI alignment problem” if one hasn’t understood the more subtle problems that would occur. It’s then easy to miss the problems and feel optimistic about IDA working while underestimating the amount of hard philosophical work that needs to be done, or to incorrectly attack the approach for missing the problems completely.
(I think that these simpler versions of IDA might be worth thinking about as a plausible fallback plan if no other alignment approach is ready in time, but only if they are restricted in terms of accomplishing specific tasks to stabilise the world, restricted in how far the amplification is taking, replaced with something better as soon as possible, etc. I also think that working on simple versions of IDA can help make progress on issues that would be required for fully scalable IDA, ie. the experiments that Ought is running.).

William_S 29 May 2020 21:20 UTC
LW: 8 AF: 5
AF
in reply to: Isnasene’s comment on: OpenAI announces GPT-3
Google’s Meena (2.6 billion parameters, February 2020) creates original puns in “Conversation D”, and I think “Cross-turn Repetition Example 2″ is absurdist comedy, but maybe more as a result of the way the model fails.
Conversation D
Human: I like how cute cows are
Meena: Ah, interesting. You like how they are cute, right?
Human: yeah, and they’re realy smart
Meena: I heard they go to college
Human: cows go to college?
Meena: I heard that a cow went to Harvard.
Human: what did the cow study?
Meena: Bovine sciences.
Human: do horses go to Harvard?
Meena: Horses go to Hayvard.
Human: that’s a pretty good joke, I feel like you led me into it
Meena: You were trying to steer it elsewhere, I can see it.

William_S 17 Dec 2020 20:57 UTC
7 points
on: Hiding Complexity
I think there are situations where you can still have subproblems where the output of the subproblem is long. A contrived example: suppose you have a problem where you want to calculate XOR(f(a), f(b)), where f(a) and f(b) are long strings. It seems reasonable to decompose into x=f(a), y=f(b), z=XOR(x, y), despite x and y being long, because there’s a simple way to combine them.

If we had an AI system that could work on “making progress on a problem for an hour”, then write down a complete description of everything it had figured out and pass that to another AI system, I’d count that as dividing the problem into subproblems, just in a way that’s probably inefficient.

I’d evaluate decompositions into subproblems by something like the total cost of solving a problem by dividing it into subproblems. Some decompositions would be efficent and others would be inefficient, sometimes this would be because the output is large but in other cases it could be because it takes a long time to write the input, or because there’s a lot of work repeated between subproblems.

William_S 22 Mar 2019 18:55 UTC
LW: 7 AF: 3
AF
on: The Main Sources of AI Risk?
- AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else). While not caused only by AI design, it is possible that design decisions could impact the likelihood of this scenario (ie. at what point are values loaded into the system/how many people’s values are loaded into the system), and is relevant for overall strategy.
What links here?
- The Main Sources of AI Risk? by Daniel Kokotajlo (21 Mar 2019 18:28 UTC; 119 points)

William_S 10 Feb 2019 19:02 UTC
LW: 7 AF: 3
AF
on: The Argument from Philosophical Difficulty
One important dimension to consider is how hard it is to solve philosophical problems well enough to have a pretty good future (which includes avoiding bad futures). It could be the case that this is not so hard, but fully resolving questions so we could produce an optimal future is very hard or impossible. It feels like this argument implicitly relies on assuming that “solve philosophical problems well enough to have a pretty good future” is hard (ie. takes thousands of millions of years in scenario 4) - can you provide further clarification on whether/why you think that is the case?

William_S 9 Feb 2019 17:43 UTC
7 points
on: Announcement: AI alignment prize round 4 winners
Slightly disappointed that this isn’t continuing (though I didn’t submit to the prize, I submitted to Paul Christiano’s call for possible problems with his approach which was similarly structured). Was hoping that once I got further into my PhD, I’d have some more things worth writing up, and the recognition/a bit of prize money would provide some extra motivation to get them out the door.
What do you feel like is the limiting resource that keeps continuing this from being useful to continue in it’s current form?

William_S 13 Jun 2018 20:21 UTC
LW: 7 AF: 4
AF
on: A general model of safety-oriented AI development
Shorter name candidates:
Inductively Aligned AI Development
Inductively Aligned AI Assistants

William_S 3 Apr 2018 19:10 UTC
LW: 7 AF: 3
AF
in reply to: Wei Dai’s comment on: Can corrigibility be learned safely?
I would see the benefits of humans vs. algorithms being that giving a human a bunch of natural language instructions would be much easier (but harder to verify) than writing down a formal algorithm. Also, the training could just cover how to avoid taking incorrigible actions, and the Overseer could still use their judgement of how to perform competently within the space of corrigible outputs.

William_S 12 Mar 2018 15:01 UTC
7 points
on: Prize for probable problems
Paul, to what degree do you think your approach will scale indefinitely while maintaining corrigibility vs. just thinking that it will scale while maintaining corrigibility to the point where we “get our house in order”? (I feel like this would help me in understanding the importance of particular objections, though objections relevant to both scenarios are probably still relevant).