William_S 11 Jun 2022 16:23 UTC
LW: 22 AF: 8
22
AF
on: AGI Ruin: A List of Lethalities
Could I put in a request to see a brain dump from Eliezer of ways to gain dignity points?

Amplification Discussion Notes

William_S1 Jun 2018 19:03 UTC

17 points

3 comments3 min readLW link

[Question] Is there an intuitive way to explain how much better superforecasters are than regular forecasters?

William_S19 Feb 2020 1:07 UTC

16 points

5 comments1 min readLW link

William_S 25 Aug 2022 16:50 UTC
13 points
2
in reply to: Chris_Leong’s comment on: OpenAI’s Alignment Plans
Jan Leike has written about inner alignment here https://aligned.substack.com/p/inner-alignment. (I’m at OpenAI, imo I’m not sure if this will work in the worst case and I’m hoping we can come up with a more robust plan)

William_S 3 Apr 2018 1:27 UTC
LW: 13 AF: 4
AF
on: Can corrigibility be learned safely?
From my current understanding of Paul’s IDA approach, I think there are two different senses in which corrigibility can be thought about in regards to IDA, both with different levels of guarantee.
From An unaligned benchmark
1. On average, the reward function incentivizes behaviour which competes effectively and gives the user effective control.
2. There do not exist inputs on which the policy choose an action because it is bad, or the value function outputs a high reward because the prior behaviour was bad. (Or else the policy on its own will generate bad consequences.)
3. The reward function never gives a behaviour a higher reward because it is bad. (Or else the test-time optimisation by MCTS can generate bad behaviour.) For example, if the AI deludes the human operator so that the operator can’t interfere with the AI’s behaviour, that behaviour can’t receive a higher reward even if it ultimately allows the AI to make more money.
Property 1 is dealing with “consequence corrigibility” (competence at producing actions that will produce outcomes in the world we would describe as corrigible)
Properties 2&3 are dealing with corrigibility in terms of “intent corrigibility” (guaranteeing that the system does not optimise for bad outcomes). This does not cover the agent incompetently causing bad actions in the world, only the agent deliberately trying to produce bad outcomes.
I think IDA doesn’t require or claim worst-case guarantees on the task of “consequence corrigibility” (and that this is an impossible goal for bounded reasoners).
I think that average-case good performance on “consequence corrigibility” is claimed by IDA, but only as a subset of general competence.
I think that providing worst-case guarantees on “intent corrigibility” is required and claimed by IDA.
Roughly, I think that:
- Versions of IDA that allow the overseer nodes more information could be generally competent (including predicting what behaviour could be corrigible), but could fail to be “intent corrigible”
- Versions of IDA that allow the overseer nodes only a highly restricted set of queries could be “intent corrigible” but fail to be generally competent, and hence not be “consequence corrigible”
- Standard ML approaches will, at some level of optimisation power, fail to behave “intent corrigibly” (even if you train them to be “consequence corrigible”)
The question I’m uncertain about is whether there’s a middle point in tradeoff space where both properties are sufficiently satisfied to produce good outcomes.
Do you agree or disagree with how I’ve broken down corrigibility claims for IDA, and which claims do you think your argument bears on?

William_S 6 May 2022 19:50 UTC
LW: 12 AF: 8
AF
on: The case for becoming a black-box investigator of language models
Maybe useful way to get feedback on how good you are at doing this would be trying to make predictions based on your experience with language models:
- without looking at the results or running systematic experiments on the dataset, predict which tasks on BIG-bench will be doable
- make bets of the form “we’ll reach X% performance on task A before we reach Y% performance on task B”
- predict for some prompt what percentage of samples will satisfy some property, then take a large number of samples and then rate them

William_S 6 Apr 2022 21:45 UTC
LW: 12 AF: 6
AF
in reply to: johnswentworth’s comment on: [Link] A minimal viable product for alignment
Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you’re being fooled). It’s still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.

William_S 10 Mar 2020 21:12 UTC
LW: 11 AF: 5
AF
on: Zoom In: An Introduction to Circuits
The worry I’d have about this interpretability direction is that we become very good at telling stories about what 95% of the weights in neural networks do, but the remaning 5% hides some important stuff, which could end up including things like mesa-optimizers or deception. Do you have thoughts on that?

William_S 1 Jul 2019 16:46 UTC
LW: 11 AF: 6
AF
on: Contest: $1,000 for good questions to ask to an Oracle AI
Submission: low-bandwidth oracle
Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.
What links here?
- Results of $1,000 Oracle contest! by Stuart_Armstrong (17 Jun 2020 17:44 UTC; 60 points)

William_S 25 Nov 2014 4:20 UTC
10 points
in reply to: KatjaGrace’s comment on: Superintelligence 11: The treacherous turn
It might flip gradually if it felt a divide and conquer strategy was useful, pretending to be nice to some people while not being nice to others. I don’t think this is likely. Otherwise, it seems very useful to conceal your intent until the point you can win.

William_S 23 Jul 2022 18:03 UTC
LW: 9 AF: 9
4
AF
on: Robustness to Scaling Down: More Important Than I Thought
You define robustness to scaling down as “a solution to alignment keeps working if the AI is not optimal or perfect.” but for interpretability you talk about “our interpretability is merely good or great, but doesn’t capture everything relevant to alignment” which seems to be about the alignment approach/our understanding being flawed not the AI. I can imagine techniques being robust to imperfect AI but find it harder to imagine how any alignment approach could be robust if the approach/our implementation of the approach itself is flawed, do you have any example of this?

William_S 26 Sep 2019 18:34 UTC
LW: 9 AF: 7
AF
on: 2-D Robustness
One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:
- use an inverse reinforcement learning algorithm infer the reward function of the off-distribution behaviour
- train a new system to do as well on the reward function as the original system
- measure the number of training steps needed to reach this point for the new system.
This would let you make comparisons between different systems as to which was more capability robust.
Maybe there’s a version that could train the new system using behavioural cloning, but it’s less clear how you measure when you’re as competent as the original agent (maybe using a discriminator?)
The reason for trying this is having a measure of competence that is less dependent on human judgement/closer to the systems’s ontology and capabilities.

William_S 4 Apr 2018 15:45 UTC
LW: 9 AF: 5
AF
in reply to: Wei Dai’s comment on: Can corrigibility be learned safely?
Among people I’ve had significant online discussions with, your writings on alignment tend to be the hardest to understand and easiest to misunderstand.
Additionally, I think that there are ways to misunderstand the IDA approach that leave out significant parts of the complexity (ie. IDA based off of humans thinking for a day with unrestricted input, without doing the hard work of trying to understand corrigibility and meta-philosophy beforehand), but can seem to be plausible things to talk about in terms of “solving the AI alignment problem” if one hasn’t understood the more subtle problems that would occur. It’s then easy to miss the problems and feel optimistic about IDA working while underestimating the amount of hard philosophical work that needs to be done, or to incorrectly attack the approach for missing the problems completely.
(I think that these simpler versions of IDA might be worth thinking about as a plausible fallback plan if no other alignment approach is ready in time, but only if they are restricted in terms of accomplishing specific tasks to stabilise the world, restricted in how far the amplification is taking, replaced with something better as soon as possible, etc. I also think that working on simple versions of IDA can help make progress on issues that would be required for fully scalable IDA, ie. the experiments that Ought is running.).

William_S 29 May 2020 21:20 UTC
LW: 8 AF: 5
AF
in reply to: Isnasene’s comment on: OpenAI announces GPT-3
Google’s Meena (2.6 billion parameters, February 2020) creates original puns in “Conversation D”, and I think “Cross-turn Repetition Example 2″ is absurdist comedy, but maybe more as a result of the way the model fails.
Conversation D
Human: I like how cute cows are
Meena: Ah, interesting. You like how they are cute, right?
Human: yeah, and they’re realy smart
Meena: I heard they go to college
Human: cows go to college?
Meena: I heard that a cow went to Harvard.
Human: what did the cow study?
Meena: Bovine sciences.
Human: do horses go to Harvard?
Meena: Horses go to Hayvard.
Human: that’s a pretty good joke, I feel like you led me into it
Meena: You were trying to steer it elsewhere, I can see it.

William_S 17 Dec 2020 20:57 UTC
7 points
on: Hiding Complexity
I think there are situations where you can still have subproblems where the output of the subproblem is long. A contrived example: suppose you have a problem where you want to calculate XOR(f(a), f(b)), where f(a) and f(b) are long strings. It seems reasonable to decompose into x=f(a), y=f(b), z=XOR(x, y), despite x and y being long, because there’s a simple way to combine them.

If we had an AI system that could work on “making progress on a problem for an hour”, then write down a complete description of everything it had figured out and pass that to another AI system, I’d count that as dividing the problem into subproblems, just in a way that’s probably inefficient.

I’d evaluate decompositions into subproblems by something like the total cost of solving a problem by dividing it into subproblems. Some decompositions would be efficent and others would be inefficient, sometimes this would be because the output is large but in other cases it could be because it takes a long time to write the input, or because there’s a lot of work repeated between subproblems.

William_S(William Saunders)

HCH is not just Me­chan­i­cal Turk

Un­der­stand­ing Iter­ated Distil­la­tion and Am­plifi­ca­tion: Claims and Oversight

Thoughts on re­fus­ing harm­ful re­quests to large lan­guage models

Re­in­force­ment Learn­ing in the Iter­ated Am­plifi­ca­tion Framework

Am­plifi­ca­tion Dis­cus­sion Notes

[Question] Is there an in­tu­itive way to ex­plain how much bet­ter su­perfore­cast­ers are than reg­u­lar fore­cast­ers?

HCH is not just Mechanical Turk

Understanding Iterated Distillation and Amplification: Claims and Oversight

Thoughts on refusing harmful requests to large language models

Reinforcement Learning in the Iterated Amplification Framework

Amplification Discussion Notes

[Question] Is there an intuitive way to explain how much better superforecasters are than regular forecasters?