I’d say that the claim is not sufficient—it might provide some alignment value, but it needs a larger story about how the whole computation is going to be safe. I do think that the HCH framework could make specifying an aligned GOFAI-like computation easier (but it’s hard to come up with a rigorous argument for this without pointing to some kind of specification that we can make claims about, which is something I’d want to produce along the way while proceeding with HCH-like approaches)
I think a cleaner way of stating condition 3 might be “there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating”.
This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment time)
This could fail to be met if the mesa-optimizer believes that 1) there is a 1% chance that it will be detected by the base optimizer 2) if this happens, the base optimizer will be able to determine $O_mesa$ and give it a large negative reward, 100x more than the possible reward from the best defecting policy. (not sure if there’s any version of this that would work, since it seems hard to figure out $O_mesa$ and provide negative reward)
To me, It seems like the point of this story is that we could build an AI that ends up doing very dangerous things without ever asking it “Will you do things I don’t like if given more capability?” or some other similar question that requires it to execute the treacherous turn. In contrast, if the developers did something like build a testing world with toy humans in it who could be manipulated in a way detectable to the developers, and placed the AI in the toy testing world, then it seems like this AI would be forced into a position where it either acts in a way according to it’s true incentives (manipulate the humans and be detected), or execute the treacherous turn (abstain from manipulating the humans so developers will trust it more). So it seems like this wouldn’t happen if the developers are trying to test for treacherous turn behaviour during development.
Are you interested in protocols involving multiple episodic questions (where you ask one question, wait for it to resolve, then ask another question?)
Submission: low-bandwidth oracle
Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.
AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else). While not caused only by AI design, it is possible that design decisions could impact the likelihood of this scenario (ie. at what point are values loaded into the system/how many people’s values are loaded into the system), and is relevant for overall strategy.
Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (which I think Andrew Critch has talked about). For example, AIs negotiating on behalf of humans take the stance described in https://arxiv.org/abs/1711.00363 of agreeing to split control of the future according to which human’s priors are most accurate (on potentially irrelevant issues) if this isn’t what humans actually want.
Maybe one AI philosophy service could look like: would ask you a bunch of other questions that are simpler than the problem of qualia, then show you what those answers imply about the problem of qualia if you use some method of reconciling those answers.
Re: Philosophy as interminable debate, another way to put the relationship between math and philosophy:
Philosophy as weakly verifiable argumentation
Math is solving problems by looking at the consequences of a small number of axiomatic reasoning steps. For something to be math, we have to be able to ultimately cash out any proof as a series of these reasoning steps. Once something is cashed out in this way, it takes a small constant amount of time to verify any reasoning step, so we can verify given polynomial time.
Philosophy is solving problems where we haven’t figured out a set of axiomatic reasoning steps. Any non-axiomatic reasoning step we propose could end up having arguments that we hadn’t thought of that would lead us to reject that step. And those arguments themselves might be undermined by other arguments, and so on. Each round of debate lets us add another level of counter-arguments. Philosophers can make progress when they have some good predictor of whether arguments are good or not, but they don’t have access to certain knowledge of arguments being good.
Another difference between mathematics and philosophy is that in mathematics we have a well defined set of objects and a well-defined problem we are asking about. Whereas in philosophy we are trying to ask questions about things that exist in the real world and/or we are asking questions that we haven’t crisply defined yet.
When we come up with a set of axioms and a description of a problem, we can move that problem from the realm of philosophy to the realm of mathematics. When we come up with some method we trust of verifying arguments (ie. replicating scientific experiments), we can move problems out of philosophy to other sciences.
It could be the case that philosophy grounds out in some reasonable set of axioms which we don’t have access to now for computational reasons—in which case it could all end up in the realm of mathematics. It could be the case that, for all practical purposes, we will never reach this state, so it will remain in the “potentially unbounded DEBATE round case”. I’m not sure what it would look like if it could never ground out—one model could be that we have a black box function that performs a probabilistic evaluation of argument strength given counter-arguments, and we go through some process to get the consequences of that, but it never looks like “here is a set of axioms”.
I guess it feels like I don’t know how we could know that we’re in the position that we’ve “solved” meta-philosophy. It feels like the thing we could do is build a set of better and better models of philosophy and check their results against held-out human reasoning and against each other.
I also don’t think we know how to specify a ground truth reasoning process that we could try to protect and run forever which we could be completely confident would come up with the right outcome (where something like HCH is a good candidate but potentially with bugs/subtleties that need to be worked out).
I feel like I have some (not well justified and possibly motivated) optimism that this process yields something good fairly early on. We could gain confidence that we are in this world if we build a bunch of better and better models of meta-philosophy and observe at some point the models continue agreeing with each other as we improve them, and that they agree with various instantiations of protected human reasoning that we run. If we are in this world, the thing we need to do is just spend some time building a variety of these kinds of models and produce an action that looks good to most of them. (Where agreement is not “comes up with the same answer” but more like “comes up with an answer that other models think is okay and not disastrous to accept”).
Do you think this would lead to “good outcomes”? Do you think some version of this approach could be satisfactory for solving the problems in Two Neglected Problems in Human-AI Safety?
Do you think there’s a different kind of thing that we would need to do to “solve metaphilosophy”? Or do you think that working on “solving metaphilosophy” roughly caches out as “work on coming up with better and better models of philosophy in the model I’ve described here”?
A couple ways to implement a hybrid approach with existing AI safety tools:
Logical Induction: Specify some computationally expensive simulation of idealized humans. Run a logical inductor with the deductive process running the simulation and outputting what the humans say after time x in simulation, as well as statements about what non-idealized humans are saying in the real world. The inductor should be able to provide beliefs about what the idealized humans will say in the future informed by information from the non-idealized humans.
HCH/IDA: The HCH-humans demonstrate a reasoning process which aims to predict the output of a set of idealized humans using all available information (which can include running simulations of idealized humans or information from real humans). The way that the HCH tree using information about real humans involves looking carefully at their circumstances and asking things like “how do the real human’s circumstances differ from the idealized human” and “is the information from the real human compromised in some way?”
It seems like for Filtered-HCH, the application in the post you linked to, you might be able to do a weaker version where you label any computation that you can’t understand in kN steps as problematic, only accepting things you think you can efficiently understand. (But I don’t think Paul is arguing for this weaker version).
RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).
I guess I’ve used the term “reinforcement learning” to refer to a broader class of problems including both one-shot bandit problems and sequential decision making problems. In this view The feature that makes RL different from supervised learning is not that we’re trying to figure out what how to act in an MDP/POMDP, but instead that we’re trying to optimize a function that we can’t take the derivative of (in the MDP case, it’s because the environment is non-differentiable, and in the approval learning case, it’s because the overseer is non-differentiable).
Re: scenario 3, see The Evitable Conflict, the last story in Isaac Asimov’s “I, Robot”:
“Stephen, how do we know what the ultimate good of Humanity will entail? We haven’t at our disposal the infinite factors that the Machine has at its! Perhaps, to give you a not unfamiliar example, our entire technical civilization has created more unhappiness and misery than it has removed. Perhaps an agrarian or pastoral civilization, with less culture and less people would be better. If so, the Machines must move in that direction, preferably without telling us, since in our ignorant prejudices we only know that what we are used to, is good – and we would then fight change. Or perhaps a complete urbanization, or a completely caste-ridden society, or complete anarchy, is the answer. We don’t know. Only the Machines know, and they are going there and taking us with them.”
Yeah, to some extent. In the Lookup Table case, you need to have a (potentially quite expensive) way of resolving all mistakes. In the Overseer’s Manual case, you can also leverage humans to do some kind of more robust reasoning (for example, they can notice a typo in a question and still respond correctly, even if the Lookup Table would fail in this case). Though in low-bandwidth oversight, the space of things that participants could notice and correct is fairly limited.
Though I think this still differs from HRAD in that it seems like the output of HRAD would be a much smaller thing in terms of description length than the Lookup Table, and you can buy extra robustness by adding many more human-reasoned things into the Lookup Table (ie. automatically add versions of all questions with typos that don’t change the meaning of a question into the Lookup Table, add 1000 different sanity check questions to flag that things can go wrong).
So I think there are additional ways the system could correct mistaken reasoning relative to what I would think the output of HRAD would look like, but you do need to have processes that you think can correct any way that reasoning goes wrong. So the problem could be less difficult than HRAD, but still tricky to get right.
Thanks, this position makes more sense in light of Beyond Astronomical Waste (I guess I have some concept of “a pretty good future” that is fine with something like a bunch of human-descended beings living a happy lives that misses out on the sort of things mentioned in Beyond Astronomical Waste, and “optimal future” which includes those considerations). I buy this as an argument that “we should put more effort into making philosophy work to make the outcome of AI better, because we risk losing large amounts of value” rather than “our efforts to get a pretty good future are doomed unless we make tons of progress on this” or something like that.
“Thousands of millions” was a typo.
What is the motivation for using RL here?
I see the motivation as given practical compute limits, it may be much easier to have the system find an action the overseer approves of instead of imitating the overseer directly. Using RL also allows you to use any advances that are made in RL by the machine learning community to try to remain competitive.
Would this still be a problem if we were training the agent with SL instead of RL?
Maybe this could happen with SL if SL does some kind of large search and finds a solution that looks good but is actually bad. The distilled agent would then learn to identify this action and reproduce it, which implies the agent learning some facts about the action to efficiently locate it with much less compute than the large search process. Knowing what the agent knows would allow the overseer to learn those facts, which might help in identifying this action as bad.
I don’t understand why we want to find this X* in the imitation learning case.
Ah, with this example the intent was more like “we can frame what the RL case is doing as finding X* , let’s show how we could accomplish the same thing in the imitation learning case (in the limit of unlimited compute)”.
The reverse mapping (imitation to RL) just consists of applying reward 1 to M2′s demonstrated behaviour (which could be “execute some safe search and return the results), and reward 0 to everything else.
What is pM(X∗)?
pM(X∗) is the probability of outputting X∗ (where pM is a stochastic policy)
M2(“How good is answer X to Y?”)∗∇log(pM(X))
This is the REINFORCE gradient estimator (which tries to increase the log probability of actions that were rated highly)
I guess the question was more from the perspective of: if the cost was zero then it seems like it would worth running, so what part of the cost makes it not worth running (where I would think of cost as probably time to judge or availability of money to fund the contest).