Substituting parts of the computation (replace a slow, correct algorithm for part of the computation with a fast, approximate one)
Formally verifying properties of the computation
Informally checking properties of the computation (is this algorithm for making loan decisions fair?)
Debugging the computation, or more generally “modifying the computation to do what you actually want”
One situation is: maybe an HBO tree of size 10^20 runs into a security failure with high probability, but an HBO tree of size 10^15 doesn’t and is sufficient to output a good LBO overseer.
Yeah, that’s my current picture of it.
I understand that, but I don’t see why #2 is likely to be achievable. Corrigibility seems very similar to Wei Dai’s translation example, so it seems like there could be many deceptive actions that humans would intuitively recognize as not corrigible, but which would fool an early-stage LBO tree into assigning a high reward. This seems like it would be a clear example of “giving a behaviour a high reward because it is bad”. Unfortunately I can’t think of any good examples, so my intuition may simply be mistaken.
For factored cognition: I think the reason #2 might be achievable relies on assuming that there is some reason that the bad features is selected in the first place and is assigned high reward. For example, this might have happened because the agent ran a simulation forward, and then realized that if they punch the human they can take the reward button from them. The hope is that we can figure out that the simulation process happened and why it lead to the wrong thing (or outlaw simluations of this form in the first place).
For factored evaluation, I think the story is a bit different (relying on the other expert being able to understand the reasons for the reward assignment and point it out to the judge, but I don’t think the judge needs to be able to find it on there own).
Incidentally, it seems like Ought could feasibly test whether meta-execution is sufficient to ensure corrigibility; for example, a malicious expert could recommend deceptive/influence-seizing actions to an agent in a simulated environment, and the meta-execution tree would have to detect every deceptive action without any contextual knowledge . Are there any plans to do this?
No plans currently, but it would be interesting.
Suppose an action is evaluated as a linear combination of a set of human interpretable features.
The action “punch the human” could be selected because 1) many of the reward weights of these features could be wrong, or it could be selected because 2) there is one feature “this action prevents the human from turning me off” that is assigned high reward. I think the thing we’d want to prevent in this case is 2) but not 1), and I think that’s more likely to be achievable.
I think it’s a general method that is most applicable in LBO, but might still be used in HBO (eg. an HBO overseer can read one chapter of a math textbook, but this doesn’t let it construct an ontology that let’s it solve complicated math problems, so instead it needs to use meta-execution to try to manipulate objects that it can’t reason about directly.
I’d interpreted it as “using the HBO system to construct a “core for reasoning” reduces the chances of failure by exposing it to less inputs/using it for less total time”, plus maybe other properties (eg. maybe we could look at and verify an LBO overseer, even if we couldn’t construct it ourselves)
Possible source for optimization-as-a-layer: SATNet (differentiable SAT solver)
One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:
use an inverse reinforcement learning algorithm infer the reward function of the off-distribution behaviour
train a new system to do as well on the reward function as the original system
measure the number of training steps needed to reach this point for the new system.
This would let you make comparisons between different systems as to which was more capability robust.
Maybe there’s a version that could train the new system using behavioural cloning, but it’s less clear how you measure when you’re as competent as the original agent (maybe using a discriminator?)
The reason for trying this is having a measure of competence that is less dependent on human judgement/closer to the systems’s ontology and capabilities.
I think the better version of this strategy would involve getting competing donations from both sides, using some weighting of total donations for/against pushing the button to set a probability of pressing the button, and tweaking the weighting of the donations such that you expect the probability of pressing the button will be low (because pressing the button threatens to lower the probability of future games of this kind, this is an iterated game rather than a one-shot).
For Alaska vs. Bali, alternative answer is “You could be convinced that either Alaska or Bali is a good vacation destination”. It’s an interesting question whether this could actually win in debate. I think it might have a better chance in Factored Evaluation, because we can spin up two seperate trees to view the most compelling argument for Alaska and the most compelling argument for Bali and verify that these are convincing. In debate, you’d need view either Alaska Argument before Bali Argument, or Bali Argument before Alaska Argument, and you might just be convinced by the first argument you see in which case you wouldn’t agree that you could be convinced either way.
I’d say that the claim is not sufficient—it might provide some alignment value, but it needs a larger story about how the whole computation is going to be safe. I do think that the HCH framework could make specifying an aligned GOFAI-like computation easier (but it’s hard to come up with a rigorous argument for this without pointing to some kind of specification that we can make claims about, which is something I’d want to produce along the way while proceeding with HCH-like approaches)
I think a cleaner way of stating condition 3 might be “there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating”.
This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment time)
This could fail to be met if the mesa-optimizer believes that 1) there is a 1% chance that it will be detected by the base optimizer 2) if this happens, the base optimizer will be able to determine $O_mesa$ and give it a large negative reward, 100x more than the possible reward from the best defecting policy. (not sure if there’s any version of this that would work, since it seems hard to figure out $O_mesa$ and provide negative reward)
To me, It seems like the point of this story is that we could build an AI that ends up doing very dangerous things without ever asking it “Will you do things I don’t like if given more capability?” or some other similar question that requires it to execute the treacherous turn. In contrast, if the developers did something like build a testing world with toy humans in it who could be manipulated in a way detectable to the developers, and placed the AI in the toy testing world, then it seems like this AI would be forced into a position where it either acts in a way according to it’s true incentives (manipulate the humans and be detected), or execute the treacherous turn (abstain from manipulating the humans so developers will trust it more). So it seems like this wouldn’t happen if the developers are trying to test for treacherous turn behaviour during development.
Are you interested in protocols involving multiple episodic questions (where you ask one question, wait for it to resolve, then ask another question?)
Submission: low-bandwidth oracle
Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.
AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else). While not caused only by AI design, it is possible that design decisions could impact the likelihood of this scenario (ie. at what point are values loaded into the system/how many people’s values are loaded into the system), and is relevant for overall strategy.
Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (which I think Andrew Critch has talked about). For example, AIs negotiating on behalf of humans take the stance described in https://arxiv.org/abs/1711.00363 of agreeing to split control of the future according to which human’s priors are most accurate (on potentially irrelevant issues) if this isn’t what humans actually want.