Possible source for optimization-as-a-layer: SATNet (differentiable SAT solver)
One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:
use an inverse reinforcement learning algorithm infer the reward function of the off-distribution behaviour
train a new system to do as well on the reward function as the original system
measure the number of training steps needed to reach this point for the new system.
This would let you make comparisons between different systems as to which was more capability robust.
Maybe there’s a version that could train the new system using behavioural cloning, but it’s less clear how you measure when you’re as competent as the original agent (maybe using a discriminator?)
The reason for trying this is having a measure of competence that is less dependent on human judgement/closer to the systems’s ontology and capabilities.
I think the better version of this strategy would involve getting competing donations from both sides, using some weighting of total donations for/against pushing the button to set a probability of pressing the button, and tweaking the weighting of the donations such that you expect the probability of pressing the button will be low (because pressing the button threatens to lower the probability of future games of this kind, this is an iterated game rather than a one-shot).
For Alaska vs. Bali, alternative answer is “You could be convinced that either Alaska or Bali is a good vacation destination”. It’s an interesting question whether this could actually win in debate. I think it might have a better chance in Factored Evaluation, because we can spin up two seperate trees to view the most compelling argument for Alaska and the most compelling argument for Bali and verify that these are convincing. In debate, you’d need view either Alaska Argument before Bali Argument, or Bali Argument before Alaska Argument, and you might just be convinced by the first argument you see in which case you wouldn’t agree that you could be convinced either way.
I’d say that the claim is not sufficient—it might provide some alignment value, but it needs a larger story about how the whole computation is going to be safe. I do think that the HCH framework could make specifying an aligned GOFAI-like computation easier (but it’s hard to come up with a rigorous argument for this without pointing to some kind of specification that we can make claims about, which is something I’d want to produce along the way while proceeding with HCH-like approaches)
I think a cleaner way of stating condition 3 might be “there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating”.
This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment time)
This could fail to be met if the mesa-optimizer believes that 1) there is a 1% chance that it will be detected by the base optimizer 2) if this happens, the base optimizer will be able to determine $O_mesa$ and give it a large negative reward, 100x more than the possible reward from the best defecting policy. (not sure if there’s any version of this that would work, since it seems hard to figure out $O_mesa$ and provide negative reward)
To me, It seems like the point of this story is that we could build an AI that ends up doing very dangerous things without ever asking it “Will you do things I don’t like if given more capability?” or some other similar question that requires it to execute the treacherous turn. In contrast, if the developers did something like build a testing world with toy humans in it who could be manipulated in a way detectable to the developers, and placed the AI in the toy testing world, then it seems like this AI would be forced into a position where it either acts in a way according to it’s true incentives (manipulate the humans and be detected), or execute the treacherous turn (abstain from manipulating the humans so developers will trust it more). So it seems like this wouldn’t happen if the developers are trying to test for treacherous turn behaviour during development.
Are you interested in protocols involving multiple episodic questions (where you ask one question, wait for it to resolve, then ask another question?)
Submission: low-bandwidth oracle
Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.
AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else). While not caused only by AI design, it is possible that design decisions could impact the likelihood of this scenario (ie. at what point are values loaded into the system/how many people’s values are loaded into the system), and is relevant for overall strategy.
Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (which I think Andrew Critch has talked about). For example, AIs negotiating on behalf of humans take the stance described in https://arxiv.org/abs/1711.00363 of agreeing to split control of the future according to which human’s priors are most accurate (on potentially irrelevant issues) if this isn’t what humans actually want.
Maybe one AI philosophy service could look like: would ask you a bunch of other questions that are simpler than the problem of qualia, then show you what those answers imply about the problem of qualia if you use some method of reconciling those answers.
Re: Philosophy as interminable debate, another way to put the relationship between math and philosophy:
Philosophy as weakly verifiable argumentation
Math is solving problems by looking at the consequences of a small number of axiomatic reasoning steps. For something to be math, we have to be able to ultimately cash out any proof as a series of these reasoning steps. Once something is cashed out in this way, it takes a small constant amount of time to verify any reasoning step, so we can verify given polynomial time.
Philosophy is solving problems where we haven’t figured out a set of axiomatic reasoning steps. Any non-axiomatic reasoning step we propose could end up having arguments that we hadn’t thought of that would lead us to reject that step. And those arguments themselves might be undermined by other arguments, and so on. Each round of debate lets us add another level of counter-arguments. Philosophers can make progress when they have some good predictor of whether arguments are good or not, but they don’t have access to certain knowledge of arguments being good.
Another difference between mathematics and philosophy is that in mathematics we have a well defined set of objects and a well-defined problem we are asking about. Whereas in philosophy we are trying to ask questions about things that exist in the real world and/or we are asking questions that we haven’t crisply defined yet.
When we come up with a set of axioms and a description of a problem, we can move that problem from the realm of philosophy to the realm of mathematics. When we come up with some method we trust of verifying arguments (ie. replicating scientific experiments), we can move problems out of philosophy to other sciences.
It could be the case that philosophy grounds out in some reasonable set of axioms which we don’t have access to now for computational reasons—in which case it could all end up in the realm of mathematics. It could be the case that, for all practical purposes, we will never reach this state, so it will remain in the “potentially unbounded DEBATE round case”. I’m not sure what it would look like if it could never ground out—one model could be that we have a black box function that performs a probabilistic evaluation of argument strength given counter-arguments, and we go through some process to get the consequences of that, but it never looks like “here is a set of axioms”.
I guess it feels like I don’t know how we could know that we’re in the position that we’ve “solved” meta-philosophy. It feels like the thing we could do is build a set of better and better models of philosophy and check their results against held-out human reasoning and against each other.
I also don’t think we know how to specify a ground truth reasoning process that we could try to protect and run forever which we could be completely confident would come up with the right outcome (where something like HCH is a good candidate but potentially with bugs/subtleties that need to be worked out).
I feel like I have some (not well justified and possibly motivated) optimism that this process yields something good fairly early on. We could gain confidence that we are in this world if we build a bunch of better and better models of meta-philosophy and observe at some point the models continue agreeing with each other as we improve them, and that they agree with various instantiations of protected human reasoning that we run. If we are in this world, the thing we need to do is just spend some time building a variety of these kinds of models and produce an action that looks good to most of them. (Where agreement is not “comes up with the same answer” but more like “comes up with an answer that other models think is okay and not disastrous to accept”).
Do you think this would lead to “good outcomes”? Do you think some version of this approach could be satisfactory for solving the problems in Two Neglected Problems in Human-AI Safety?
Do you think there’s a different kind of thing that we would need to do to “solve metaphilosophy”? Or do you think that working on “solving metaphilosophy” roughly caches out as “work on coming up with better and better models of philosophy in the model I’ve described here”?
A couple ways to implement a hybrid approach with existing AI safety tools:
Logical Induction: Specify some computationally expensive simulation of idealized humans. Run a logical inductor with the deductive process running the simulation and outputting what the humans say after time x in simulation, as well as statements about what non-idealized humans are saying in the real world. The inductor should be able to provide beliefs about what the idealized humans will say in the future informed by information from the non-idealized humans.
HCH/IDA: The HCH-humans demonstrate a reasoning process which aims to predict the output of a set of idealized humans using all available information (which can include running simulations of idealized humans or information from real humans). The way that the HCH tree using information about real humans involves looking carefully at their circumstances and asking things like “how do the real human’s circumstances differ from the idealized human” and “is the information from the real human compromised in some way?”
It seems like for Filtered-HCH, the application in the post you linked to, you might be able to do a weaker version where you label any computation that you can’t understand in kN steps as problematic, only accepting things you think you can efficiently understand. (But I don’t think Paul is arguing for this weaker version).
RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).
I guess I’ve used the term “reinforcement learning” to refer to a broader class of problems including both one-shot bandit problems and sequential decision making problems. In this view The feature that makes RL different from supervised learning is not that we’re trying to figure out what how to act in an MDP/POMDP, but instead that we’re trying to optimize a function that we can’t take the derivative of (in the MDP case, it’s because the environment is non-differentiable, and in the approval learning case, it’s because the overseer is non-differentiable).
Re: scenario 3, see The Evitable Conflict, the last story in Isaac Asimov’s “I, Robot”:
“Stephen, how do we know what the ultimate good of Humanity will entail? We haven’t at our disposal the infinite factors that the Machine has at its! Perhaps, to give you a not unfamiliar example, our entire technical civilization has created more unhappiness and misery than it has removed. Perhaps an agrarian or pastoral civilization, with less culture and less people would be better. If so, the Machines must move in that direction, preferably without telling us, since in our ignorant prejudices we only know that what we are used to, is good – and we would then fight change. Or perhaps a complete urbanization, or a completely caste-ridden society, or complete anarchy, is the answer. We don’t know. Only the Machines know, and they are going there and taking us with them.”
Yeah, to some extent. In the Lookup Table case, you need to have a (potentially quite expensive) way of resolving all mistakes. In the Overseer’s Manual case, you can also leverage humans to do some kind of more robust reasoning (for example, they can notice a typo in a question and still respond correctly, even if the Lookup Table would fail in this case). Though in low-bandwidth oversight, the space of things that participants could notice and correct is fairly limited.
Though I think this still differs from HRAD in that it seems like the output of HRAD would be a much smaller thing in terms of description length than the Lookup Table, and you can buy extra robustness by adding many more human-reasoned things into the Lookup Table (ie. automatically add versions of all questions with typos that don’t change the meaning of a question into the Lookup Table, add 1000 different sanity check questions to flag that things can go wrong).
So I think there are additional ways the system could correct mistaken reasoning relative to what I would think the output of HRAD would look like, but you do need to have processes that you think can correct any way that reasoning goes wrong. So the problem could be less difficult than HRAD, but still tricky to get right.