RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).
I guess I’ve used the term “reinforcement learning” to refer to a broader class of problems including both one-shot bandit problems and sequential decision making problems. In this view The feature that makes RL different from supervised learning is not that we’re trying to figure out what how to act in an MDP/POMDP, but instead that we’re trying to optimize a function that we can’t take the derivative of (in the MDP case, it’s because the environment is non-differentiable, and in the approval learning case, it’s because the overseer is non-differentiable).
Re: scenario 3, see The Evitable Conflict, the last story in Isaac Asimov’s “I, Robot”:
“Stephen, how do we know what the ultimate good of Humanity will entail? We haven’t at our disposal the infinite factors that the Machine has at its! Perhaps, to give you a not unfamiliar example, our entire technical civilization has created more unhappiness and misery than it has removed. Perhaps an agrarian or pastoral civilization, with less culture and less people would be better. If so, the Machines must move in that direction, preferably without telling us, since in our ignorant prejudices we only know that what we are used to, is good – and we would then fight change. Or perhaps a complete urbanization, or a completely caste-ridden society, or complete anarchy, is the answer. We don’t know. Only the Machines know, and they are going there and taking us with them.”
Yeah, to some extent. In the Lookup Table case, you need to have a (potentially quite expensive) way of resolving all mistakes. In the Overseer’s Manual case, you can also leverage humans to do some kind of more robust reasoning (for example, they can notice a typo in a question and still respond correctly, even if the Lookup Table would fail in this case). Though in low-bandwidth oversight, the space of things that participants could notice and correct is fairly limited.
Though I think this still differs from HRAD in that it seems like the output of HRAD would be a much smaller thing in terms of description length than the Lookup Table, and you can buy extra robustness by adding many more human-reasoned things into the Lookup Table (ie. automatically add versions of all questions with typos that don’t change the meaning of a question into the Lookup Table, add 1000 different sanity check questions to flag that things can go wrong).
So I think there are additional ways the system could correct mistaken reasoning relative to what I would think the output of HRAD would look like, but you do need to have processes that you think can correct any way that reasoning goes wrong. So the problem could be less difficult than HRAD, but still tricky to get right.
Thanks, this position makes more sense in light of Beyond Astronomical Waste (I guess I have some concept of “a pretty good future” that is fine with something like a bunch of human-descended beings living a happy lives that misses out on the sort of things mentioned in Beyond Astronomical Waste, and “optimal future” which includes those considerations). I buy this as an argument that “we should put more effort into making philosophy work to make the outcome of AI better, because we risk losing large amounts of value” rather than “our efforts to get a pretty good future are doomed unless we make tons of progress on this” or something like that.
“Thousands of millions” was a typo.
What is the motivation for using RL here?
I see the motivation as given practical compute limits, it may be much easier to have the system find an action the overseer approves of instead of imitating the overseer directly. Using RL also allows you to use any advances that are made in RL by the machine learning community to try to remain competitive.
Would this still be a problem if we were training the agent with SL instead of RL?
Maybe this could happen with SL if SL does some kind of large search and finds a solution that looks good but is actually bad. The distilled agent would then learn to identify this action and reproduce it, which implies the agent learning some facts about the action to efficiently locate it with much less compute than the large search process. Knowing what the agent knows would allow the overseer to learn those facts, which might help in identifying this action as bad.
I don’t understand why we want to find this X* in the imitation learning case.
Ah, with this example the intent was more like “we can frame what the RL case is doing as finding X* , let’s show how we could accomplish the same thing in the imitation learning case (in the limit of unlimited compute)”.
The reverse mapping (imitation to RL) just consists of applying reward 1 to M2′s demonstrated behaviour (which could be “execute some safe search and return the results), and reward 0 to everything else.
What is pM(X∗)?
pM(X∗) is the probability of outputting X∗ (where pM is a stochastic policy)
M2(“How good is answer X to Y?“)∗∇log(pM(X))
This is the REINFORCE gradient estimator (which tries to increase the log probability of actions that were rated highly)
I guess the question was more from the perspective of: if the cost was zero then it seems like it would worth running, so what part of the cost makes it not worth running (where I would think of cost as probably time to judge or availability of money to fund the contest).
One important dimension to consider is how hard it is to solve philosophical problems well enough to have a pretty good future (which includes avoiding bad futures). It could be the case that this is not so hard, but fully resolving questions so we could produce an optimal future is very hard or impossible. It feels like this argument implicitly relies on assuming that “solve philosophical problems well enough to have a pretty good future” is hard (ie. takes thousands of millions of years in scenario 4) - can you provide further clarification on whether/why you think that is the case?
Slightly disappointed that this isn’t continuing (though I didn’t submit to the prize, I submitted to Paul Christiano’s call for possible problems with his approach which was similarly structured). Was hoping that once I got further into my PhD, I’d have some more things worth writing up, and the recognition/a bit of prize money would provide some extra motivation to get them out the door.
What do you feel like is the limiting resource that keeps continuing this from being useful to continue in it’s current form?
Yeah, this is a problem that needs to be addressed. It feels like in the Overseers Manual case you can counteract this by giving definitions/examples of how you want questions to be interpreted, and in the Lookup Table case this can be addr by coordination within the team creating the lookup table
Do you think you’d agree with a claim of this form applied to corrigibility of plans/policies/actions?
That is: If some plan/policy/action is uncorrigible, then A can provide some description of how the action is incorrigible.
The better we can solve the key questions (“what are these ‘wiser’ versions?“, “how is the whole setup designed?“, “what questions exactly is it trying to answer?“), the better the wiser ourselves will be at their tasks.
I feel like this statement suggests that we might not be doomed if we make a bunch of progress, but not full progress on these statements. I agree with that assessment, but it felt on reading the post like the post was making the claim “Unless we fully specify a correct theory of human values, we are doomed”.
I think that I’d view something like Paul’s indirect normativity approach as requiring that we do enough thinking in advance to get some critical set of considerations known by the participating humans, but once that’s in place we should be able to go from this core set to get the rest of the considerations. And it seems possible that we can do this without a fully-solved theory of human value (but any theoretical progress in advance we can make on defining human value is quite useful).
My interpretation of what you’re saying here is that the overseer in step #1 can do a lot of things to bake in having the AI interpret “help the user get what they really want” in ways that get the AI to try to eliminate human safety problems for the step #2 user (possibly entirely), but problems might still occur in the short term before the AI is able to think/act to remove those safety problems.
It seems to me that this implies that IDA essentially solves the AI alignment portion of points 1 and 2 in the original post (modulo things happening before the AI is in control).
Correcting all problems in the subsequent amplification stage would be a nice property to have, but I think IDA can still work even if it corrects errors with multiple A/D steps in between (as long as all catastrophic errors are caught before deployment). For example, I could think of the agent initially using some rules for how to solve math problems where distillation introduces some mistake, but later in the IDA process the agent learns how to rederive those rules and realizes the mistake.
Shorter name candidates:
Inductively Aligned AI Development
Inductively Aligned AI Assistants
It’s a nice property of this model that it prompts consideration of the interaction between humans and AIs at every step (to highlight things like risks of the humans having access to some set of AI systems for manipulation or moral hazard reasons).
In the higher dimensional belief/reward space, do you think that it would be possible to significantly narrow down the space of possibilities (so this argument is saying “be bayesian with respect to reward/beliefs, picking policies that work over a distribution) or are you more pessimistic than that, thinking that the uncertainty would be so great in higher dimensional spaces that it would not be possible to pick a good policy?
Open Question: Working with concepts that the human can’t understand
Question: when we need to assemble complex concepts by learning/interacting with the environment, rather than using H’s concepts directly, and when those concepts influence reasoning in subtle/abstract ways, how do we retain corrigibility/alignment?
Paul: I don’t have any general answer to this, seems like we should probably choose some example cases. I’m probably going to be advocating something like “Search over a bunch of possible concepts and find one that does what you want / has the desired properties.”
E.g. for elegant proofs, you want a heuristic that gives successful lines of inquiry higher scores. You can explore a bunch of concepts that do that, evaluate each one according to how well it discriminates good from bad lines of inquiry, and also evaluate other stuff like “What would I infer from learning that a proof is `elegant` other than that it will work” and make sure that you are OK with that.
Andreas: Suppose you don’t have the concepts of “proof” and “inquiry”, but learned them (or some more sophisticated analogs) using the sort of procedure you outlined below. I guess I’m trying to see in more detail that you can do a good job at “making sure you’re OK with reasoning in ways X” in cases where X is far removed from H’s concepts. (Unfortunately, it seems to be difficult to make progress on this by discussing particular examples, since examples are necessarily about concepts we know pretty well.)
This may be related to the more general question of what sorts of instructions you’d give H to ensure that if they follow the instructions, the overall process remains corrigible/aligned.
Open Question: Severity of “Honest Mistakes”
In the discussion about creative problem solving,Paul said that he was concerned about problems arising when the solution generator was deliberately searching for a solution with harmful side effects. Other failures could occur where the solution generator finds a solution with harmful side effects without “deliberately searching” for it. The question is how bad these “honest mistakes” would end up being.
Paul: I also want to make the further claim that such failures are much less concerning than what-I’m-calling-alignment failures, which is a possible disagreement we could dig into (I think Wei Dai disagrees or is very unsure).