AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else). While not caused only by AI design, it is possible that design decisions could impact the likelihood of this scenario (ie. at what point are values loaded into the system/how many people’s values are loaded into the system), and is relevant for overall strategy.
Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (which I think Andrew Critch has talked about). For example, AIs negotiating on behalf of humans take the stance described in https://arxiv.org/abs/1711.00363 of agreeing to split control of the future according to which human’s priors are most accurate (on potentially irrelevant issues) if this isn’t what humans actually want.
Maybe one AI philosophy service could look like: would ask you a bunch of other questions that are simpler than the problem of qualia, then show you what those answers imply about the problem of qualia if you use some method of reconciling those answers.
Re: Philosophy as interminable debate, another way to put the relationship between math and philosophy:
Philosophy as weakly verifiable argumentation
Math is solving problems by looking at the consequences of a small number of axiomatic reasoning steps. For something to be math, we have to be able to ultimately cash out any proof as a series of these reasoning steps. Once something is cashed out in this way, it takes a small constant amount of time to verify any reasoning step, so we can verify given polynomial time.
Philosophy is solving problems where we haven’t figured out a set of axiomatic reasoning steps. Any non-axiomatic reasoning step we propose could end up having arguments that we hadn’t thought of that would lead us to reject that step. And those arguments themselves might be undermined by other arguments, and so on. Each round of debate lets us add another level of counter-arguments. Philosophers can make progress when they have some good predictor of whether arguments are good or not, but they don’t have access to certain knowledge of arguments being good.
Another difference between mathematics and philosophy is that in mathematics we have a well defined set of objects and a well-defined problem we are asking about. Whereas in philosophy we are trying to ask questions about things that exist in the real world and/or we are asking questions that we haven’t crisply defined yet.
When we come up with a set of axioms and a description of a problem, we can move that problem from the realm of philosophy to the realm of mathematics. When we come up with some method we trust of verifying arguments (ie. replicating scientific experiments), we can move problems out of philosophy to other sciences.
It could be the case that philosophy grounds out in some reasonable set of axioms which we don’t have access to now for computational reasons—in which case it could all end up in the realm of mathematics. It could be the case that, for all practical purposes, we will never reach this state, so it will remain in the “potentially unbounded DEBATE round case”. I’m not sure what it would look like if it could never ground out—one model could be that we have a black box function that performs a probabilistic evaluation of argument strength given counter-arguments, and we go through some process to get the consequences of that, but it never looks like “here is a set of axioms”.
I guess it feels like I don’t know how we could know that we’re in the position that we’ve “solved” meta-philosophy. It feels like the thing we could do is build a set of better and better models of philosophy and check their results against held-out human reasoning and against each other.
I also don’t think we know how to specify a ground truth reasoning process that we could try to protect and run forever which we could be completely confident would come up with the right outcome (where something like HCH is a good candidate but potentially with bugs/subtleties that need to be worked out).
I feel like I have some (not well justified and possibly motivated) optimism that this process yields something good fairly early on. We could gain confidence that we are in this world if we build a bunch of better and better models of meta-philosophy and observe at some point the models continue agreeing with each other as we improve them, and that they agree with various instantiations of protected human reasoning that we run. If we are in this world, the thing we need to do is just spend some time building a variety of these kinds of models and produce an action that looks good to most of them. (Where agreement is not “comes up with the same answer” but more like “comes up with an answer that other models think is okay and not disastrous to accept”).
Do you think this would lead to “good outcomes”? Do you think some version of this approach could be satisfactory for solving the problems in Two Neglected Problems in Human-AI Safety?
Do you think there’s a different kind of thing that we would need to do to “solve metaphilosophy”? Or do you think that working on “solving metaphilosophy” roughly caches out as “work on coming up with better and better models of philosophy in the model I’ve described here”?
A couple ways to implement a hybrid approach with existing AI safety tools:
Logical Induction: Specify some computationally expensive simulation of idealized humans. Run a logical inductor with the deductive process running the simulation and outputting what the humans say after time x in simulation, as well as statements about what non-idealized humans are saying in the real world. The inductor should be able to provide beliefs about what the idealized humans will say in the future informed by information from the non-idealized humans.
HCH/IDA: The HCH-humans demonstrate a reasoning process which aims to predict the output of a set of idealized humans using all available information (which can include running simulations of idealized humans or information from real humans). The way that the HCH tree using information about real humans involves looking carefully at their circumstances and asking things like “how do the real human’s circumstances differ from the idealized human” and “is the information from the real human compromised in some way?”
It seems like for Filtered-HCH, the application in the post you linked to, you might be able to do a weaker version where you label any computation that you can’t understand in kN steps as problematic, only accepting things you think you can efficiently understand. (But I don’t think Paul is arguing for this weaker version).
RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).
I guess I’ve used the term “reinforcement learning” to refer to a broader class of problems including both one-shot bandit problems and sequential decision making problems. In this view The feature that makes RL different from supervised learning is not that we’re trying to figure out what how to act in an MDP/POMDP, but instead that we’re trying to optimize a function that we can’t take the derivative of (in the MDP case, it’s because the environment is non-differentiable, and in the approval learning case, it’s because the overseer is non-differentiable).
Re: scenario 3, see The Evitable Conflict, the last story in Isaac Asimov’s “I, Robot”:
“Stephen, how do we know what the ultimate good of Humanity will entail? We haven’t at our disposal the infinite factors that the Machine has at its! Perhaps, to give you a not unfamiliar example, our entire technical civilization has created more unhappiness and misery than it has removed. Perhaps an agrarian or pastoral civilization, with less culture and less people would be better. If so, the Machines must move in that direction, preferably without telling us, since in our ignorant prejudices we only know that what we are used to, is good – and we would then fight change. Or perhaps a complete urbanization, or a completely caste-ridden society, or complete anarchy, is the answer. We don’t know. Only the Machines know, and they are going there and taking us with them.”
Yeah, to some extent. In the Lookup Table case, you need to have a (potentially quite expensive) way of resolving all mistakes. In the Overseer’s Manual case, you can also leverage humans to do some kind of more robust reasoning (for example, they can notice a typo in a question and still respond correctly, even if the Lookup Table would fail in this case). Though in low-bandwidth oversight, the space of things that participants could notice and correct is fairly limited.
Though I think this still differs from HRAD in that it seems like the output of HRAD would be a much smaller thing in terms of description length than the Lookup Table, and you can buy extra robustness by adding many more human-reasoned things into the Lookup Table (ie. automatically add versions of all questions with typos that don’t change the meaning of a question into the Lookup Table, add 1000 different sanity check questions to flag that things can go wrong).
So I think there are additional ways the system could correct mistaken reasoning relative to what I would think the output of HRAD would look like, but you do need to have processes that you think can correct any way that reasoning goes wrong. So the problem could be less difficult than HRAD, but still tricky to get right.
Thanks, this position makes more sense in light of Beyond Astronomical Waste (I guess I have some concept of “a pretty good future” that is fine with something like a bunch of human-descended beings living a happy lives that misses out on the sort of things mentioned in Beyond Astronomical Waste, and “optimal future” which includes those considerations). I buy this as an argument that “we should put more effort into making philosophy work to make the outcome of AI better, because we risk losing large amounts of value” rather than “our efforts to get a pretty good future are doomed unless we make tons of progress on this” or something like that.
“Thousands of millions” was a typo.
What is the motivation for using RL here?
I see the motivation as given practical compute limits, it may be much easier to have the system find an action the overseer approves of instead of imitating the overseer directly. Using RL also allows you to use any advances that are made in RL by the machine learning community to try to remain competitive.
Would this still be a problem if we were training the agent with SL instead of RL?
Maybe this could happen with SL if SL does some kind of large search and finds a solution that looks good but is actually bad. The distilled agent would then learn to identify this action and reproduce it, which implies the agent learning some facts about the action to efficiently locate it with much less compute than the large search process. Knowing what the agent knows would allow the overseer to learn those facts, which might help in identifying this action as bad.
I don’t understand why we want to find this X* in the imitation learning case.
Ah, with this example the intent was more like “we can frame what the RL case is doing as finding X* , let’s show how we could accomplish the same thing in the imitation learning case (in the limit of unlimited compute)”.
The reverse mapping (imitation to RL) just consists of applying reward 1 to M2′s demonstrated behaviour (which could be “execute some safe search and return the results), and reward 0 to everything else.
What is pM(X∗)?
pM(X∗) is the probability of outputting X∗ (where pM is a stochastic policy)
M2(“How good is answer X to Y?“)∗∇log(pM(X))
This is the REINFORCE gradient estimator (which tries to increase the log probability of actions that were rated highly)
I guess the question was more from the perspective of: if the cost was zero then it seems like it would worth running, so what part of the cost makes it not worth running (where I would think of cost as probably time to judge or availability of money to fund the contest).
One important dimension to consider is how hard it is to solve philosophical problems well enough to have a pretty good future (which includes avoiding bad futures). It could be the case that this is not so hard, but fully resolving questions so we could produce an optimal future is very hard or impossible. It feels like this argument implicitly relies on assuming that “solve philosophical problems well enough to have a pretty good future” is hard (ie. takes thousands of millions of years in scenario 4) - can you provide further clarification on whether/why you think that is the case?
Slightly disappointed that this isn’t continuing (though I didn’t submit to the prize, I submitted to Paul Christiano’s call for possible problems with his approach which was similarly structured). Was hoping that once I got further into my PhD, I’d have some more things worth writing up, and the recognition/a bit of prize money would provide some extra motivation to get them out the door.
What do you feel like is the limiting resource that keeps continuing this from being useful to continue in it’s current form?
Yeah, this is a problem that needs to be addressed. It feels like in the Overseers Manual case you can counteract this by giving definitions/examples of how you want questions to be interpreted, and in the Lookup Table case this can be addr by coordination within the team creating the lookup table
Do you think you’d agree with a claim of this form applied to corrigibility of plans/policies/actions?
That is: If some plan/policy/action is uncorrigible, then A can provide some description of how the action is incorrigible.
The better we can solve the key questions (“what are these ‘wiser’ versions?“, “how is the whole setup designed?“, “what questions exactly is it trying to answer?“), the better the wiser ourselves will be at their tasks.
I feel like this statement suggests that we might not be doomed if we make a bunch of progress, but not full progress on these statements. I agree with that assessment, but it felt on reading the post like the post was making the claim “Unless we fully specify a correct theory of human values, we are doomed”.
I think that I’d view something like Paul’s indirect normativity approach as requiring that we do enough thinking in advance to get some critical set of considerations known by the participating humans, but once that’s in place we should be able to go from this core set to get the rest of the considerations. And it seems possible that we can do this without a fully-solved theory of human value (but any theoretical progress in advance we can make on defining human value is quite useful).