member of OpenAI scalable alignment team
William_S(William Saunders)
Could I put in a request to see a brain dump from Eliezer of ways to gain dignity points?
Jan Leike has written about inner alignment here https://aligned.substack.com/p/inner-alignment. (I’m at OpenAI, imo I’m not sure if this will work in the worst case and I’m hoping we can come up with a more robust plan)
From my current understanding of Paul’s IDA approach, I think there are two different senses in which corrigibility can be thought about in regards to IDA, both with different levels of guarantee.
1. On average, the reward function incentivizes behaviour which competes effectively and gives the user effective control.
2. There do not exist inputs on which the policy choose an action because it is bad, or the value function outputs a high reward because the prior behaviour was bad. (Or else the policy on its own will generate bad consequences.)
3. The reward function never gives a behaviour a higher reward because it is bad. (Or else the test-time optimisation by MCTS can generate bad behaviour.) For example, if the AI deludes the human operator so that the operator can’t interfere with the AI’s behaviour, that behaviour can’t receive a higher reward even if it ultimately allows the AI to make more money.
Property 1 is dealing with “consequence corrigibility” (competence at producing actions that will produce outcomes in the world we would describe as corrigible)
Properties 2&3 are dealing with corrigibility in terms of “intent corrigibility” (guaranteeing that the system does not optimise for bad outcomes). This does not cover the agent incompetently causing bad actions in the world, only the agent deliberately trying to produce bad outcomes.
I think IDA doesn’t require or claim worst-case guarantees on the task of “consequence corrigibility” (and that this is an impossible goal for bounded reasoners).
I think that average-case good performance on “consequence corrigibility” is claimed by IDA, but only as a subset of general competence.
I think that providing worst-case guarantees on “intent corrigibility” is required and claimed by IDA.
Roughly, I think that:
Versions of IDA that allow the overseer nodes more information could be generally competent (including predicting what behaviour could be corrigible), but could fail to be “intent corrigible”
Versions of IDA that allow the overseer nodes only a highly restricted set of queries could be “intent corrigible” but fail to be generally competent, and hence not be “consequence corrigible”
Standard ML approaches will, at some level of optimisation power, fail to behave “intent corrigibly” (even if you train them to be “consequence corrigible”)
The question I’m uncertain about is whether there’s a middle point in tradeoff space where both properties are sufficiently satisfied to produce good outcomes.
Do you agree or disagree with how I’ve broken down corrigibility claims for IDA, and which claims do you think your argument bears on?
Maybe useful way to get feedback on how good you are at doing this would be trying to make predictions based on your experience with language models:
without looking at the results or running systematic experiments on the dataset, predict which tasks on BIG-bench will be doable
make bets of the form “we’ll reach X% performance on task A before we reach Y% performance on task B”
predict for some prompt what percentage of samples will satisfy some property, then take a large number of samples and then rate them
Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you’re being fooled). It’s still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.
The worry I’d have about this interpretability direction is that we become very good at telling stories about what 95% of the weights in neural networks do, but the remaning 5% hides some important stuff, which could end up including things like mesa-optimizers or deception. Do you have thoughts on that?
Submission: low-bandwidth oracle
Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.
It might flip gradually if it felt a divide and conquer strategy was useful, pretending to be nice to some people while not being nice to others. I don’t think this is likely. Otherwise, it seems very useful to conceal your intent until the point you can win.
You define robustness to scaling down as “a solution to alignment keeps working if the AI is not optimal or perfect.” but for interpretability you talk about “our interpretability is merely good or great, but doesn’t capture everything relevant to alignment” which seems to be about the alignment approach/our understanding being flawed not the AI. I can imagine techniques being robust to imperfect AI but find it harder to imagine how any alignment approach could be robust if the approach/our implementation of the approach itself is flawed, do you have any example of this?
One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:
use an inverse reinforcement learning algorithm infer the reward function of the off-distribution behaviour
train a new system to do as well on the reward function as the original system
measure the number of training steps needed to reach this point for the new system.
This would let you make comparisons between different systems as to which was more capability robust.
Maybe there’s a version that could train the new system using behavioural cloning, but it’s less clear how you measure when you’re as competent as the original agent (maybe using a discriminator?)
The reason for trying this is having a measure of competence that is less dependent on human judgement/closer to the systems’s ontology and capabilities.
Among people I’ve had significant online discussions with, your writings on alignment tend to be the hardest to understand and easiest to misunderstand.
Additionally, I think that there are ways to misunderstand the IDA approach that leave out significant parts of the complexity (ie. IDA based off of humans thinking for a day with unrestricted input, without doing the hard work of trying to understand corrigibility and meta-philosophy beforehand), but can seem to be plausible things to talk about in terms of “solving the AI alignment problem” if one hasn’t understood the more subtle problems that would occur. It’s then easy to miss the problems and feel optimistic about IDA working while underestimating the amount of hard philosophical work that needs to be done, or to incorrectly attack the approach for missing the problems completely.
(I think that these simpler versions of IDA might be worth thinking about as a plausible fallback plan if no other alignment approach is ready in time, but only if they are restricted in terms of accomplishing specific tasks to stabilise the world, restricted in how far the amplification is taking, replaced with something better as soon as possible, etc. I also think that working on simple versions of IDA can help make progress on issues that would be required for fully scalable IDA, ie. the experiments that Ought is running.).
Google’s Meena (2.6 billion parameters, February 2020) creates original puns in “Conversation D”, and I think “Cross-turn Repetition Example 2″ is absurdist comedy, but maybe more as a result of the way the model fails.
Conversation D
Human: I like how cute cows are
Meena: Ah, interesting. You like how they are cute, right?
Human: yeah, and they’re realy smart
Meena: I heard they go to college
Human: cows go to college?
Meena: I heard that a cow went to Harvard.
Human: what did the cow study?
Meena: Bovine sciences.
Human: do horses go to Harvard?
Meena: Horses go to Hayvard.
Human: that’s a pretty good joke, I feel like you led me into it
Meena: You were trying to steer it elsewhere, I can see it.
I think there are situations where you can still have subproblems where the output of the subproblem is long. A contrived example: suppose you have a problem where you want to calculate XOR(f(a), f(b)), where f(a) and f(b) are long strings. It seems reasonable to decompose into x=f(a), y=f(b), z=XOR(x, y), despite x and y being long, because there’s a simple way to combine them.
If we had an AI system that could work on “making progress on a problem for an hour”, then write down a complete description of everything it had figured out and pass that to another AI system, I’d count that as dividing the problem into subproblems, just in a way that’s probably inefficient.
I’d evaluate decompositions into subproblems by something like the total cost of solving a problem by dividing it into subproblems. Some decompositions would be efficent and others would be inefficient, sometimes this would be because the output is large but in other cases it could be because it takes a long time to write the input, or because there’s a lot of work repeated between subproblems.
AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else). While not caused only by AI design, it is possible that design decisions could impact the likelihood of this scenario (ie. at what point are values loaded into the system/how many people’s values are loaded into the system), and is relevant for overall strategy.
One important dimension to consider is how hard it is to solve philosophical problems well enough to have a pretty good future (which includes avoiding bad futures). It could be the case that this is not so hard, but fully resolving questions so we could produce an optimal future is very hard or impossible. It feels like this argument implicitly relies on assuming that “solve philosophical problems well enough to have a pretty good future” is hard (ie. takes thousands of millions of years in scenario 4) - can you provide further clarification on whether/why you think that is the case?
Slightly disappointed that this isn’t continuing (though I didn’t submit to the prize, I submitted to Paul Christiano’s call for possible problems with his approach which was similarly structured). Was hoping that once I got further into my PhD, I’d have some more things worth writing up, and the recognition/a bit of prize money would provide some extra motivation to get them out the door.
What do you feel like is the limiting resource that keeps continuing this from being useful to continue in it’s current form?
Shorter name candidates:
Inductively Aligned AI Development
Inductively Aligned AI Assistants
I would see the benefits of humans vs. algorithms being that giving a human a bunch of natural language instructions would be much easier (but harder to verify) than writing down a formal algorithm. Also, the training could just cover how to avoid taking incorrigible actions, and the Overseer could still use their judgement of how to perform competently within the space of corrigible outputs.
Paul, to what degree do you think your approach will scale indefinitely while maintaining corrigibility vs. just thinking that it will scale while maintaining corrigibility to the point where we “get our house in order”? (I feel like this would help me in understanding the importance of particular objections, though objections relevant to both scenarios are probably still relevant).
Christmas.