I think that while my intelligence is not greater than the alien’s, I would probably do the thing that you suggested, “don’t do anything the user would find terrible; acquire resources; make sure the user remains safe and retains effective control over those resource”, but if the aliens were to start to trust me enough to upgrade my cognitive abilities to be above theirs, I could very well end up causing disaster (from their perspective) either by subtly misunderstanding some fine point of their values/philosophical views (*), or by subverting the system through some design or implementation flaw. The point is that my behavior while my abilities are less than super-alien are not a very good indication of how safe I will eventually be.
(*) To expand on this, suppose that as my cognitive abilities increase, I develop increasingly precise models of the alien, and at some point I decide that I can satisfy the alien’s values better by using resources directly instead of letting the alien retain control (i.e., I could act more efficiently this way and I think that my model is as good as the actual alien), but it turns out that I’m wrong about how good my model is, and end up acting on a subtly-but-disastrously wrong version of the alien’s values / philosophical views.
or by subverting the system through some design or implementation flaw
I discuss the most concerning-to-me instance of this in problem (1) here; it seems like that discussion applies equally well to anything that might work fine at first but then break when you become a sufficiently smart reasoner.
I think the basic question is whether you can identify and exploit such flaws at exactly the same time that you recognize their possibility, or whether you can notice them slightly before. By “before” I mean with a version of you that is less clever, has less time to think, has a weaker channel to influence the world, or is treated with more skepticism and caution.
If any of these versions of you can identify the looming problem in advance, and then explain it to the aliens, then they can correct the problem. I don’t know if I’ve ever encountered a possible flaw that wasn’t noticeable “before” it was exploitable in one of these senses. But I may just be overlooking them, and of course even if we can’t think of any it’s not such great reassurance.
Of course even if you can’t identify such flaws, you can preemptively improve the setup for the aliens, in advance of improving your own cognition. So it seems like we never really care about the case where you are radically smarter than the designer of the system, we care about the case where you are very slightly smarter. (Unless this system-improvement is a significant fraction of the difficulty of actually improving your cognition, which seems far-fetched.)
The point is that my behavior while my abilities are less than super-alien are not a very good indication of how safe I will eventually be.
Other than the issue from the first part of this comment, I don’t really see why the behavior changes (in a way that invalidates early testing) when you become super-alien in some respects. It seems like you are focusing on errors you may make that would cause you to receive a low payoff in the RL game. As you become smarter, I expect you to make fewer such errors. I certainly don’t expect you to predictably make more of them.
(I understand that this is a bit subtle, because as you get smarter the problem also may get harder, since your plans will e.g. be subject to more intense scrutiny and to more clever counterproposals. But that doesn’t seem prone to lead to the kinds of errors you discuss.)
Other than the issue from the first part of this comment, I don’t really see why the behavior changes (in a way that invalidates early testing) when you become super-alien in some respects. It seems like you are focusing on errors you may make that would cause you to receive a low payoff in the RL game. As you become smarter, I expect you to make fewer such errors.
Paraphrasing, I think you’re saying that, if the reinforcement game setup continues to work, you expect to make fewer errors as you get smarter. And the only way getting smarter hurts you is if it breaks the game (by enabling you to fall into traps faster than you can notice and avoid them).
I don’t see how this could predictably happen without something else going wrong first. I agree and acknowledged that the question-learning solution is hard-to-test, so let’s focus on the RL approach. (Though I also don’t expect this to happen for the question-answering solution.) In this comment I’ll focus on the misunderstanding case.
So in the future, you expect to predictably make a decision which the aliens would consider catastrophically bad. It seems to me like:
If the solution would really be considered catastrophically bad, and it chosen for evaluation, then it will receive a very low payoff—unless the scheme fails in some other way that we have not yet discussed.
So you would only make such mistakes if you thought that you would receive enough expected benefit from more aggressive decisions that it offsets this predictable possibility of a low payoff from catastrophic error.
But if you took more conservative actions, you could justify those actions (when they were evaluated) by explaining the predicted possibility of a catastrophic outcome. Unless something else has gone wrong, the aliens care more about averting this prospect of a bad outcome than saving time by you being more aggressive, so they shouldn’t penalize you for this.
So if you behave aggressively even at the risk of a catastrophic error, it seems like one of the following must have gone wrong:
In fact the aliens wouldn’t be able to detect a catastrophic error during evaluation.
The conservative policy is actually worse than the aggressive policy in expectation, based on the considered judgment of the aliens.
The aliens wouldn’t accept the justification for conservatism, based on a correct argument that its costs are outweighed by the possibility for error.
This argument is wrong, or else it’s right but you wouldn’t recognize this argument or something like it.
Any of these could happen. 1 and 3 seem like they lead to more straightforward problems with the scheme, so would be worthwhile to explore on other grounds. 2 doesn’t seem likely to me, unless we are dealing with a very minor catastrophe. But I am open to arguing about it. The basic question seems to be how tough it is to ask the aliens enough questions to avoid doing anything terrible.
The examples you give in the parallel thread don’t seem they could present a big problem; you can ask the alien a modest number of questions like “how do you feel about the tradeoff between the world being destroyed and you controlling less of it?” And you can help to the maximum possible extent in answering them. Of course the alien won’t have perfect answers, but their situation seems better than the situation prior to building such an AI, when they were also making such tradeoffs imperfectly (presumably even more imperfectly, unless you are completely unhelpful to the aliens for answering such questions). And there don’t seem to be many plans where the cost of implementing the plan is greater than the cost of consulting the alien about how it feels about possible consequences of that plan.
Of course you can also get this information in other ways (e.g. look at writings and past behavior of the aliens) or ask more open-ended questions like “what are the most likely way things could go wrong, given what I expect to do over the next week,” or pursue compromise solutions that the aliens are unlikely to consider too objectionable.
ETA: actually it’s fine if the catastrophic plan is not evaluated badly, all of the work can be done in the step where the aliens prefer conservative plans to aggressive ones in general, after you explain the possibility of a catastrophic error.
The conservative policy is actually worse than the aggressive policy in expectation, based on the considered judgment of the aliens.
What if this is true, because other aliens (people) have similar AIs, so the aggressive policy is considered better, in a PD-like game theoretic sense, but it would have been better for everyone if nobody had built such AIs?
With any of the black-box designs I’ve seen, I would be very reluctant to push the button that would potentially give it superhuman capabilities, even if we have theoretical reasons to think that it would be safe, and we’ve fixed all the problems we’ve detected while testing at lower levels of computing power. There are too many things that could go wrong with such theoretical reasoning, and easily many more flaws that won’t become apparent until the system becomes smarter. Basically the only reason to do it would be time pressure, due to the AI race or something else. (With other kinds of FAI designs, i.e., normative and white-box metaphilosophical, it seems that we can eventually be more confident about their safety but they are harder to design and implement in the first place, so we should wait for them if we have the option to.) Do you agree with this?
Basically the only reason to do it would be time pressure… Do you agree with this?
In some sense I agree. If there were no time pressure, then we would want to proceed in only the very safest way possible, which would not involve AI at all. My best guess would be to do a lot of philosphical and strategic thinking as unmodified and essentially unaided humans, perhaps for a very very long time. After that you might decide on a single, maximally inoffensive computional aid, and then repeat. But this seems like quite an alien scenario!
I am not sold that in milder cases you would be much better off with e.g. a normative AI than black box designs. Why is it less error prone? It seems like normative AI must perform well across a wide range of unanticipated environments, to a much greater extent than with black box designs, and with clearer catastrophic consequences for failure. It seems like you would want to do something that remains under the control of something as close to a human as possible, for as long as possible.
In some sense the black box approach is clearly more dangerous (ignoring time limits), since it doesn’t really get you closer to your goal. We will probably have to solve these other problems eventually. The black box metaphilosophical AI is really more like a form of cognitive enhancement. But it seems like enhancement is basically the right thing to do for now, even if we make the time crunch quite a bit milder.
There are too many things that could go wrong with such theoretical reasoning
I agree that you need to rely extensively on empirical evidence to gain confidence in your system (though I think this is probably true for other plausible approaches as well).
I don’t fully understand the argument against using such empirical evidence, or rather I don’t see how to make the argument go through without the prospect of a treacherous turn, which we haven’t addressed at all.
I agree that there are kinds of problems which do not arise when testing with weak systems, and these problems cannot be detected empirically until you are working with stronger systems. The main question is whether you can debug these problems while working with stronger systems.
For concreteness, imagine a bug in your system for rendering answers; a malicious answer can compromise the computer and assign the answerer a high reward. Let’s be pessimistic, and assume that the easiest way to compromise the computer is to create a devourer that will be released into the world, that is as smart as the answerer itself, and whose preferences are maximally destructive.
(This is going to repeat ideas from the great-uncle...)
It seems like this possibility is not catastrophic.
--before
The attack can probably be noticed by the only slightly weaker agents which helped you design the current system. The only problem is if the attack cannot be noticed by any agent who isn’t clever enough to realize that such an attack is possible.
You can sanitize the agent’s answers using a similarly intelligent agent with more restricted outputs, e.g. one who just says “possible exploit” or “not a possible exploit.”
The execution environment itself can make extensive use of similarly smart agents, and is therefore less likely to be vulnerable to an attack even if inexpertly designed.
---after
After the attack, the devourer is not much smarter than the other AI’s in the world (for whom, by hypothesis, the control system has been working OK so far). So it doesn’t seem like it should do much damage before being contained.
The basic issue is that agent Z is preceded by nearly-as-smart agent Y, and so the prospects for danger are not that great if agent Z (and only agent Z) fails. The argument for catastrophic risks seems to come from the big disparity in power between the AI being controlled and the human overseer.
The concern seems to rest on the prospect of a treacherous turn: if agent Z is clever enough, then agent Y will also defect rather than helping you contain agent Z, even if in earlier tests agent Y seemed to be benign.
But I don’t see why this would happen for the RL system, and it seems like that’s the crux of the argument. I agree that it would happen for the question-answering system (I was the first to admit that the question-answering system was hard to test).
You may have other kinds of difficulties in mind, but all of the ones that I can think of seem to rest on a treacherous turn or something similar. Is there some other reason to expect failure to be catastrophic?
Is there some other reason to expect failure to be catastrophic?
I’m not pointing out any specific reasons, but just expect that in general, failures when dealing with large amounts of computing power can easily be catastrophic. You have theoretical arguments for why they won’t be, given a specific design, but again I am skeptical of such arguments in general.
I agree there is some risk that cannot be removed with either theoretical arguments or empirical evidence. But why is it greater for this kind of AI than any other, and in particular than white-box metaphilosophical or normative AI?
Normative AI seems like by far the worst, since:
it generally demonstrates a treacherous turn if you make an error,
it must work correctly across a range of unanticipated environments
So in that case we have particular concrete reasons to think that emprical testing won’t be adequate, in addition to the general concern that empirical testing and theoretical argument is never sufficient. To me, white box metaphilosophical AI seems somewhere in between.
(One complaint is that I just haven’t given an especially strong theoretical argument. I agree with that, and I hope that whatever systems people actually use, they are backed by something more convicing. But the current state of the argument seems like it can’t point in any direction other than in favor of black box designs, since we don’t yet have any arguments at all that any other kind of system could work.)
What if this is true, because other aliens (people) have similar AIs, so the aggressive policy is considered better, in a PD-like game theoretic sense, but it would have been better for everyone if nobody had built such AIs?
It seems like the question is: “How much more productive is the aggressive policy?”
It looks to me like the answer is “Maybe it’s 1% cheaper or something, though probably less.” In this case, it doesn’t seem like the AI itself is introducing (much of) a PD situation, and the coordination problem can probably be solved.
I don’t know whether you are disagreeing about the likley cost of the aggressive policy, or the consequences of slight productivity advantages for the aggressive policy. I discuss this issue a bit here, a post I wrote a few days ago but just got around to posting.
Of course there may be orthogonal reasons that the AI faces PD-like problems, e.g. it is possible to expand in an undesirably destructive way by building an unrelated and dangerous technology. Then either:
The alien user would want to coordinate in the prisoner’s dilemma. In this case, the AI will coordinate as well (unless it makes an error leading to a lower reward).
The alien user doesn’t want to coordinate in the prisoner’s dilemma. But in this case, the problem isn’t with the AI at all. If the users hadn’t built AI they would have faced the same problem.
I don’t know which of these you have in mind. My guess is you are thinking of (2) if anything, but this doesn’t really seem like an issue to do with AI control. Yes, the AI may have a differential effect on e.g. the availability of destructive tech and our ability to coordinate, and yes, we should try encourage differential progress in AI capabilities just like we want to encourage differential progress in society’s capabilities more broadly. But I don’t see how any solution to the AI control problem is going to address that issue, nor does it seem especially concerning when compared to the AI control problem.
It looks to me like the answer is “Maybe it’s 1% cheaper or something, though probably less.”
Maybe we have different things in mind for “aggressive policy”. I was think something like “give the AI enough computing power to achieve superhuman intelligence so it can hopefully build a full-fledged FAI for the user” vs the “conservative policy” of “keep the AI at its current level where it seems safe, and find another way to build an FAI”.
A separate but related issue is that it appears such an AI can either be a relatively safe or unsafe AI, depending on the disposition of the overseer (since an overseer less concerned with safety would be more likely to approve of potentially unsafe modifications to the AI). In a sidenote of the linked article, you wrote about why unsafe but more efficient AI projects won’t overtake the safer AI projects in AI research:
I think the main issue is that the safe projects can enjoy greater economies of scale, and can make life somehat more annoying for the unsafe projects by trading with them on less favorable terms. I think we can see plenty of models for this sort of thing in the world today.
But how will the safe projects exclude the unsafe projects from economies of scale and favorable terms of trade, if the unsafe projects are using the same basic design but just have overseers who care more about capability than safety?
But how will the safe projects exclude the unsafe projects from economies of scale and favorable terms of trade, if the unsafe projects are using the same basic design but just have overseers who care more about capability than safety?
Controlling the distribution of AI technology is one way to make someone’s life harder, but it’s not the only way. If we imagine a productivity gap as small as 1%, it seems like it doesn’t take much to close it.
(Disclaimer: this is unusally wild speculation; nothing I say is likely to be true, but hopefully it gives the flavor.)
If unsafe projects perfectly pretend to be safe projects, then they aren’t being more efficient. So it seems like we can assume that they are observably different from safe projects. (For example, there can’t just be complexity-loving humans who oversee projects exactly as if they had normal values; they need to skimp on oversight in order to actually be more eficient. Or else they need to differ in some other way...) If they are observably different, then possible measures include:
Even very small tax rates coupled with redistribution that is even marginally better-directed at safe projects (e.g. that goes to humans)
Regulatory measures to force everyone to incur the overhead, or most of the overhead, of being safe, e.g. lower bounds on human involvement.
Today many trades involve trust and understanding between the parties (e.g. if I go work for you). Probably some trades will retain this character. Honest people may be less happy to trade with those they expect to be malicious. I doubt this would be a huge factor, but 1% seems tiny.
Even in this scenario it may be easy to make technology which is architecturally harder to use by unsafe projects. E.g., it’s not clear whether the end user is the only overseer, or whether some oversight can be retained by law enforcement or the designers or someone else.
Of course unsafe projects can go to greater lengths in order to avoid these issues, for example by moving to friendlier jurisdictions or operating a black market in unsafe technology. But as these measures become more extreme they become increasingly easy to identify. If unsafe jurisdictions and black markets have only a few percent of the population of the world, then it’s easy to see how they could be less efficient.
(I’d also expect e.g. unsafe jurisdictions to quickly cave under international pressure, if the rents they could extract were a fraction of a percent of total productivty. They could easily be paid off, and if they didn’t want to be paid off, they would not be miliratily competitive.)
All of these measures become increasingly implausible at large productivty differentials. And I doubt that any of these particular foreseeable measures will be important. But overall, given that there are economies of scale, I find it very likely that the majority can win. The main question is whether they care enough to.
Normally I am on the other side of a discussion similar to this one, but involving much larger posited productivity gaps and a more confident claim (things are so likely to be OK that it’s not worth worrying about safety). Sorry if you were imagining a very much larger gap, so that this discussion isn’t helpful. And I do agree that there is a real possibility that things won’t be OK, even for small productivity gaps, but I feel like it’s more likely than not to be OK.
Also note that at a 1% gap, we can basically wait it out. If 10% of the world starts out malicious, then by the time the economy has grown 1000x, then 11% of the world is malicious, and it seems implausible that the AI situation won’t change during that time—certainly contemporary thinking about AI will be obsoleted, in an economic period as long as 0-2015AD. (The discussion of social coordination is more important in the case where there are larger efficiency gaps, and hence probably larger differences in how the projects look and what technology they need.)
ETA: Really the situation is not so straightforward, since 1% more productivity leads to more than 1% more profit; overall this issue really seems too complicated for this kind of vague theoretical speculation to be meaningfully accurate, but I hope I’ve given the basic flavor of my thinking.
And finally, I intended 1% as a relatively conservative estimate. I don’t see any particular reason you need to have so much waste, and I wouldn’t be surprised if it end up much lower, if future people end up pursuing some strategy along these lines.
1% seems really low to me. Suppose for example that the AI invents a modification to itself, which is meant to improve its performance. A cautious overseer might demand an explanation of the improvement and why it’s safe, in terms that he can understand, while an incautious overseer might be willing to just approve the modification right away and start using it. It seems to me that the cost of developing an understandable and convincing explanation of the improvement and its safety and then waiting for the overseer to process that, could easily be greater than 1% (or even 100%) of the cost of the inventing the improvement itself.
Also, caution/safety is a matter of degree, and it seems hard to define what “unsafe” means, for the purpose of imposing a penalty on all unsafe projects. (As you said, it would be even safer to not use any machine aid at all until much later. Where/how do you draw the line between “safe enough” and “unsafe”?) It also seems hard to tell which projects are skimping on safety from the outside (you can observe that they are advancing faster than expected given their resource base, but how do you rule out other explanations for that?), while from the inside, the workers “in the know” can probably find all kinds of justifications for doing what they are doing.
Really the situation is not so straightforward, since 1% more productivity leads to more than 1% more profit
Yes, I was about to point this out. In many areas the difference between the best and the second best is already the difference between making a profit and making a loss. This will be even more true for AI, since if AIs can be copied, there won’t be many fields where it makes economic sense to hire a 99% efficient AI instead of a copy of a 100% efficient AI.
In many areas the difference between the best and the second best is already the difference between making a profit and making a loss.
Without this issue the “wait it out” solution would seem almost completely satisfactory to me for a gap as small as 1% (or even 5%). Most of my comment was about making unsafe projects less efficient, which (I think) addresses this point.
Also note that 1% productivity benefit only leads to >> 1% efficiency if you are trading extensively with the rest of the world, so that most of your revenue is from sales (and the rest of the world can tell you are the most efficient, or at least that you are selling a lot of stuff cheaply). It’s much easier to regulate or make life harder for someone who is trading extensively with the rest of the world than someone who is operating in secret, so this seems to strengthen the arguments given in the body of my my comment.
Also, caution/safety is a matter of degree, and it seems hard to define what “unsafe” means, for the purpose of imposing a penalty on all unsafe projects
The ideas is just to make the relative cost of safety as low as possible. In the linked post I gave a quantification of safety. So let’s consider some 99.9% safe project, and the level of oversight it requires.
If this oversight is expensive because it involves using some resource (like involving human overseers, or periodically pausing as you wait on an overseer, or whatever) then it would be sufficient to require each project to use that resource, or to provide that resource for free to any project using tax dollars, or so on.
Alternatively, if there is some kind of oversight (e.g. periodic high-intensity audits, or AI oversight of AI projects, or mandated close involvement of human auditors) then the goal would just be to ensure that the price of evading detection eats up the efficiency benefits of unsafety. This looks pretty plausible to me, but you may be more skeptical about the feasibility of oversight.
A cautious overseer might demand an explanation of the improvement and why it’s safe, in terms that he can understand
In this proposal:
A cautious overseer demands such an argument with very small probability. I’ll write soon about just how small I think this probability can safely be, I think in the original post I suggested 1%, but I think it could be much lower. If the check is only done with probability 1/10000, then it’s fine if the cost is 10,000% of the cost of implementing the project.
The human overseer has very little role in the early stages of the evaluation, and certainly they don’t have to understand the whole proposal. In my proposal as written she has to relay questions between different AI’s, but of course you could automate that completely, so that the first stages of evaluation are just done by other AI’s (whose evaluations are accountable to other AI’s, whose evaluations are accountable to other AI’s… whose evaluations are accountable to the overseer). At worst, the overseer’s role is similar to the arbirtrator in this scheme, though she has many additional techniques at her disposal.
If the world is moving fast, the main problem is probably the latency of the human evaluation. But there are a bunch of plausible-seeming techniques for getting to low latency. I hope to write about this soon as well.
ETA: What I should have said is that an overseer requests explanations very rarely during normal, intended operation. But this relies on the AI’s ability to find a policy which the overseer will predictably approve of.
Perhaps your point is that a more cautious overseer should request explanations more often, owing to the AI’s limited ability to predict the overseer’s reaction. But if so, we are going around in circles. What is the nature of these improvements, such that whether they are a good idea or not depends in such a detailed way on the values or philosophical views of the overseer? Why can’t one AI build an improved AI which also follows an innocuous policy like “don’t do anything terrible; acquire resources; let the user control those resources”?
Even if the actual biological overseer doesn’t evaluate a proposal, the AI still has to predict what the overseer’s judgement would be, if he or she were to, right? That involves modeling the overseer and the other sub-AIs (each of which is also modeling the overseer and other sub-AIs) and coming up with arguments that can convince the modeled overseer/AIs, etc. I don’t see how that could be only 1% less efficient, compared to an incautious overseer who the AI knows probably won’t demand such arguments or consult lots of sub-AIs and would be willing to try out any improvements that aren’t obviously dangerous.
Separately, the “innocuous” AI you talked about suffers from two efficiency-sapping problems: the AI isn’t sure what the overseer’s values are, and the overseer isn’t sure what his own values are. This forces the AI to avoid certain actions that may be highly efficient but potentially ethically disastrous. For example suppose the overseer isn’t sure whether negative feedback received by reinforcement-based agents has negative moral value, so the AI has to avoid building lots of reinforcement-based subagents. Another example might be that the AI faces a lot of everyday ethical questions in the course of acquiring resources, and has to take the latency hit of asking the overseer about them every time. These efficiency-sapping problems are worse for overseers with more complex ethical views, and more uncertain ethical views.
If there is competition, everyone has strong incentives to quickly build “full-fledged FAIs” which can solve these ethical problems and know exactly what they should and shouldn’t do. People who are less cautious will again have an efficiency advantage while doing this. E.g., they might be fine with building a standard utility-maximizing AI based on a crude model of their current understanding of ethics. I do not see how mandatory oversights or other social techniques can prevent this outcome, if you’re imagining a world where your AI design is being used widely. Someone could make a copy of an existing AI based on your design, change the code or configuration files to make themselves the overseer and remove the mandatory oversights, and then ask the AI to make a “full-fledged FAI” for them, and if they happen to be of the incautious type, this will probably result in the kind of crude normative AI mentioned above (or worse, if they approve a bunch of “improvements” that end up subverting their intentions altogether).
Re paragraph 3: it seems like these are mostly considerations that might strengthen your conclusions if we granted that there was a big productivity difference between my design and a “a standard utility-maximizing AI based on a crude model of their current understanding of ethics.” But I would already be happy to classify a large productivity loss as a failure, so let’s just concentrate on the claimed productivity loss.
If there is competition, everyone has strong incentives to quickly build “full-fledged FAIs” which can solve these ethical problems and know exactly what they should and shouldn’t do
These incentives only operate if there is a big productivity difference.
Beyond that, if the kinds of issues peope run into are “the AI faces a lot of everyday ethical questions in the course of acquiring resources,” then it really seems like what you need is a not-catastrophically-wrong model of human morality, which would probably just be built in mundane ways. I don’t see a strong argument that this would require lots of impresive conceptual progress, rather than being simiar-in-kind to building a predictive model of anything else. But I suspect this is just a reflection of the disagreement about paragraph 2, which we should flesh out in the sibling.
Someone could make a copy of an existing AI based on your design, change the code or configuration files to make themselves the overseer and remove the mandatory oversights, and then ask the AI to make a “full-fledged FAI” for them, and if they happen to be of the incautious type, this will probably result in the kind of crude normative AI mentioned above (or worse, if they approve a bunch of “improvements” that end up subverting their intentions altogether).
This doesn’t seem like a very general argument against the possibility of mandatory oversight or technological handicapping, and the measures you describe seem like strawmen. I agree that whatever kind of oversight you employ, it will be possible to subvert it, whatever tax you charge it will be possible to evade it, and so on. But doing so will often come with a cost (as it does today), and it just doesn’t seem that hard to get it up to a 1% loss (say). We could talk more about the particular measures that could be taken for oversight; I’m sure we can both imagine many regulatory and technological approaches that would be more annoying to sidestep than an entry in a configuration file, but I suspect our disagreement comes form us imagining different productivity gaps.
The prospect of someone designing their own AI, which is very architecturally different from the rest of the world, just doesn’t seem especially troubling, unless you imagine that the rest of the world is using a significantly handicapped design. (See the first sentence of this reply.)
that involves modeling the overseer and the other sub-AIs (each of which is also modeling the overseer and other sub-AIs) and coming up with arguments that can convince the modeled overseer/AIs, etc
I don’t think this is necessary. Let’s go back to the alien analogy. Let’s assume that you have an action that effectively implements U = “acquire resources; don’t be evil; leave the user in control.” I claimed that pursuing U is not much harder than acquiring resources on its own; for the moment can we grant that, since it seems orthogonal to the issues in this paragraph?
So you have this good action, and you are deciding what to do. I claim that you should just do it, with no extra computation. Then, if it gets evaluated, you will have to think about rhetorical strategies to use when convincing the overseer its a good idea. Also, a copy of you will have to think about how to convince the overseer that it’s a bad idea. And the overseer will need to actually think about the issue. And so on.
There are some issues that you’ll need to think about before you commit to the action. For example, if there is a knockdown argument that this action is a bad idea, then you shouldn’t do it. But that thinking seems to be a necessary part of maximizing U, not an extra step (unless the arguing system is broken).
You might be able to do even better by thinking through all of the arguments in advance, and taking an action which is not merely good but which will fare especially well during an evaluation. But if the extra computational costs of doing so are significant, this has to be weighed against the benefits of actually finding a better plan. It seems like this basically comes down to the question “If you take the very most efficient plan, and we perform some elaborate procedure for arguing, can you actually ‘convince’ the overseer that it’s a good plan?”
If the answer is “no” then you are screwed anyway, before we even start thinking about computational overhead—you simply can’t take the best plan, because you wouldn’t be able to receive a good rating if you do. And if the answer is “yes” then the problem seems to go away. The justifiability of good plans does seem to be a key question about the feasibility of this scheme, which I am happy to discuss if it’s underlying objection you have in mind.
This forces the AI to avoid certain actions that may be highly efficient but potentially ethically disastrous.
If the most efficient actions are ethically disastrous, then we have a fundamental problem which seems to me to be unrelated to AI safety, and which no AI control proposal will address. For example, if the most efficient strategy is to build a virus that kills everyone but you, and an AI is considering this strategy but has to reject it because it is unethical, then we are just out of luck. We could call this a problem with “AI,” but it’s really a problem with biotechnology.
If a certain kind of reinforcement learning is especially efficient but morally unacceptable, then that seems to be the same situation. What are we supposed to do, other than either accept the moral cost or adopt a good enough social solution to overcome the efficiency gap? What kind of solution might you hope to find that would make this kind of problem go away?
If the efficient actions merely might be ethically disastrous, then I guess the cost is supposed to be the time required to clarify the overseer’s values. Which brings us to:
Another example might be that the AI faces a lot of everyday ethical questions in the course of acquiring resources, and has to take the latency hit of asking the overseer about them every time.
The question is just how many distinct questions of this form there are, and how important they are to the AI’s plans. If there were merely a billion such questions it doesn’t seem like a big deal at all (though then a significant occupation of humans would be answering moral questions).
Even that strikes me as completely implausible given our experience so far (combined with my inability to see many future examples). If I were the user, and people were trying to optimize values using the range of policies available today, then it seems like they would have had to ask me no more than a dozen or so questions to get things basically right (i.e. realizing much more than 99% of the potential value from my perspective). So this seems to require moral problems to proliferate at a much faster rate than technological problems.
Do you disagree about the importance of hard ethical questions in the situation today (e.g. I am implicitly overlooking many important issues because I’m not used to dealing with an AI), or do you just expect more proliferation in the future?
Also, the problem of predicting human moral judgments doesn’t seem to be radically harder than the problem of e.g. negotiating with humans. I guess this is just another angle on “how many distinct moral questions do you have to answer?” since the real question is how much you can generalize from each answer. I don’t feel like there are that many hard-to-predict parameters before everything reduces to easy-to-predict consequences.
Maybe we have different things in mind for “aggressive policy”. I was think something like “give the AI enough computing power to achieve superhuman intelligence so it can hopefully build a full-fledged FAI for the user” vs the “conservative policy” of “keep the AI at its current level where it seems safe, and find another way to build an FAI”.
Your examples of possible mistakes seemed to involve not knowing how the alien would feel about particular tradeoffs. This doesn’t seem related to how much computational power you have, except insofar as having more power might lead you to believe that it is safe to try and figure out what the alien thinks from first principles. But that’s not a necessary consequence of having more computing power, and I gave an argument that more computing power shouldn’t predictably lead to trouble.
Why do you think that more computing power requires a strategy which is “aggressive” in the sense of having a higher probability of catastrophic failure?
You might expect that building “full-fledged FAI” requires knowing a lot about the alien, and you won’t be able to figure all of that out in advance of building it. But again, I don’t understand why you can’t build an AI that implements a conservative strategy, in the sense of being quick to consult the user and unlikley to make a catastrophic error. So it seems like this just begs the question about the relative efficacy of conservative vs. aggressive strategies.
I think that while my intelligence is not greater than the alien’s, I would probably do the thing that you suggested, “don’t do anything the user would find terrible; acquire resources; make sure the user remains safe and retains effective control over those resource”, but if the aliens were to start to trust me enough to upgrade my cognitive abilities to be above theirs, I could very well end up causing disaster (from their perspective) either by subtly misunderstanding some fine point of their values/philosophical views (*), or by subverting the system through some design or implementation flaw. The point is that my behavior while my abilities are less than super-alien are not a very good indication of how safe I will eventually be.
(*) To expand on this, suppose that as my cognitive abilities increase, I develop increasingly precise models of the alien, and at some point I decide that I can satisfy the alien’s values better by using resources directly instead of letting the alien retain control (i.e., I could act more efficiently this way and I think that my model is as good as the actual alien), but it turns out that I’m wrong about how good my model is, and end up acting on a subtly-but-disastrously wrong version of the alien’s values / philosophical views.
I discuss the most concerning-to-me instance of this in problem (1) here; it seems like that discussion applies equally well to anything that might work fine at first but then break when you become a sufficiently smart reasoner.
I think the basic question is whether you can identify and exploit such flaws at exactly the same time that you recognize their possibility, or whether you can notice them slightly before. By “before” I mean with a version of you that is less clever, has less time to think, has a weaker channel to influence the world, or is treated with more skepticism and caution.
If any of these versions of you can identify the looming problem in advance, and then explain it to the aliens, then they can correct the problem. I don’t know if I’ve ever encountered a possible flaw that wasn’t noticeable “before” it was exploitable in one of these senses. But I may just be overlooking them, and of course even if we can’t think of any it’s not such great reassurance.
Of course even if you can’t identify such flaws, you can preemptively improve the setup for the aliens, in advance of improving your own cognition. So it seems like we never really care about the case where you are radically smarter than the designer of the system, we care about the case where you are very slightly smarter. (Unless this system-improvement is a significant fraction of the difficulty of actually improving your cognition, which seems far-fetched.)
Other than the issue from the first part of this comment, I don’t really see why the behavior changes (in a way that invalidates early testing) when you become super-alien in some respects. It seems like you are focusing on errors you may make that would cause you to receive a low payoff in the RL game. As you become smarter, I expect you to make fewer such errors. I certainly don’t expect you to predictably make more of them.
(I understand that this is a bit subtle, because as you get smarter the problem also may get harder, since your plans will e.g. be subject to more intense scrutiny and to more clever counterproposals. But that doesn’t seem prone to lead to the kinds of errors you discuss.)
Paraphrasing, I think you’re saying that, if the reinforcement game setup continues to work, you expect to make fewer errors as you get smarter. And the only way getting smarter hurts you is if it breaks the game (by enabling you to fall into traps faster than you can notice and avoid them).
Is that right?
I don’t see how this could predictably happen without something else going wrong first. I agree and acknowledged that the question-learning solution is hard-to-test, so let’s focus on the RL approach. (Though I also don’t expect this to happen for the question-answering solution.) In this comment I’ll focus on the misunderstanding case.
So in the future, you expect to predictably make a decision which the aliens would consider catastrophically bad. It seems to me like:
If the solution would really be considered catastrophically bad, and it chosen for evaluation, then it will receive a very low payoff—unless the scheme fails in some other way that we have not yet discussed.
So you would only make such mistakes if you thought that you would receive enough expected benefit from more aggressive decisions that it offsets this predictable possibility of a low payoff from catastrophic error.
But if you took more conservative actions, you could justify those actions (when they were evaluated) by explaining the predicted possibility of a catastrophic outcome. Unless something else has gone wrong, the aliens care more about averting this prospect of a bad outcome than saving time by you being more aggressive, so they shouldn’t penalize you for this.
So if you behave aggressively even at the risk of a catastrophic error, it seems like one of the following must have gone wrong:
In fact the aliens wouldn’t be able to detect a catastrophic error during evaluation.
The conservative policy is actually worse than the aggressive policy in expectation, based on the considered judgment of the aliens.
The aliens wouldn’t accept the justification for conservatism, based on a correct argument that its costs are outweighed by the possibility for error.
This argument is wrong, or else it’s right but you wouldn’t recognize this argument or something like it.
Any of these could happen. 1 and 3 seem like they lead to more straightforward problems with the scheme, so would be worthwhile to explore on other grounds. 2 doesn’t seem likely to me, unless we are dealing with a very minor catastrophe. But I am open to arguing about it. The basic question seems to be how tough it is to ask the aliens enough questions to avoid doing anything terrible.
The examples you give in the parallel thread don’t seem they could present a big problem; you can ask the alien a modest number of questions like “how do you feel about the tradeoff between the world being destroyed and you controlling less of it?” And you can help to the maximum possible extent in answering them. Of course the alien won’t have perfect answers, but their situation seems better than the situation prior to building such an AI, when they were also making such tradeoffs imperfectly (presumably even more imperfectly, unless you are completely unhelpful to the aliens for answering such questions). And there don’t seem to be many plans where the cost of implementing the plan is greater than the cost of consulting the alien about how it feels about possible consequences of that plan.
Of course you can also get this information in other ways (e.g. look at writings and past behavior of the aliens) or ask more open-ended questions like “what are the most likely way things could go wrong, given what I expect to do over the next week,” or pursue compromise solutions that the aliens are unlikely to consider too objectionable.
ETA: actually it’s fine if the catastrophic plan is not evaluated badly, all of the work can be done in the step where the aliens prefer conservative plans to aggressive ones in general, after you explain the possibility of a catastrophic error.
What if this is true, because other aliens (people) have similar AIs, so the aggressive policy is considered better, in a PD-like game theoretic sense, but it would have been better for everyone if nobody had built such AIs?
With any of the black-box designs I’ve seen, I would be very reluctant to push the button that would potentially give it superhuman capabilities, even if we have theoretical reasons to think that it would be safe, and we’ve fixed all the problems we’ve detected while testing at lower levels of computing power. There are too many things that could go wrong with such theoretical reasoning, and easily many more flaws that won’t become apparent until the system becomes smarter. Basically the only reason to do it would be time pressure, due to the AI race or something else. (With other kinds of FAI designs, i.e., normative and white-box metaphilosophical, it seems that we can eventually be more confident about their safety but they are harder to design and implement in the first place, so we should wait for them if we have the option to.) Do you agree with this?
In some sense I agree. If there were no time pressure, then we would want to proceed in only the very safest way possible, which would not involve AI at all. My best guess would be to do a lot of philosphical and strategic thinking as unmodified and essentially unaided humans, perhaps for a very very long time. After that you might decide on a single, maximally inoffensive computional aid, and then repeat. But this seems like quite an alien scenario!
I am not sold that in milder cases you would be much better off with e.g. a normative AI than black box designs. Why is it less error prone? It seems like normative AI must perform well across a wide range of unanticipated environments, to a much greater extent than with black box designs, and with clearer catastrophic consequences for failure. It seems like you would want to do something that remains under the control of something as close to a human as possible, for as long as possible.
In some sense the black box approach is clearly more dangerous (ignoring time limits), since it doesn’t really get you closer to your goal. We will probably have to solve these other problems eventually. The black box metaphilosophical AI is really more like a form of cognitive enhancement. But it seems like enhancement is basically the right thing to do for now, even if we make the time crunch quite a bit milder.
I agree that you need to rely extensively on empirical evidence to gain confidence in your system (though I think this is probably true for other plausible approaches as well).
I don’t fully understand the argument against using such empirical evidence, or rather I don’t see how to make the argument go through without the prospect of a treacherous turn, which we haven’t addressed at all.
I agree that there are kinds of problems which do not arise when testing with weak systems, and these problems cannot be detected empirically until you are working with stronger systems. The main question is whether you can debug these problems while working with stronger systems.
For concreteness, imagine a bug in your system for rendering answers; a malicious answer can compromise the computer and assign the answerer a high reward. Let’s be pessimistic, and assume that the easiest way to compromise the computer is to create a devourer that will be released into the world, that is as smart as the answerer itself, and whose preferences are maximally destructive.
(This is going to repeat ideas from the great-uncle...)
It seems like this possibility is not catastrophic.
--before
The attack can probably be noticed by the only slightly weaker agents which helped you design the current system. The only problem is if the attack cannot be noticed by any agent who isn’t clever enough to realize that such an attack is possible.
You can sanitize the agent’s answers using a similarly intelligent agent with more restricted outputs, e.g. one who just says “possible exploit” or “not a possible exploit.”
The execution environment itself can make extensive use of similarly smart agents, and is therefore less likely to be vulnerable to an attack even if inexpertly designed.
---after
After the attack, the devourer is not much smarter than the other AI’s in the world (for whom, by hypothesis, the control system has been working OK so far). So it doesn’t seem like it should do much damage before being contained.
The basic issue is that agent Z is preceded by nearly-as-smart agent Y, and so the prospects for danger are not that great if agent Z (and only agent Z) fails. The argument for catastrophic risks seems to come from the big disparity in power between the AI being controlled and the human overseer.
The concern seems to rest on the prospect of a treacherous turn: if agent Z is clever enough, then agent Y will also defect rather than helping you contain agent Z, even if in earlier tests agent Y seemed to be benign.
But I don’t see why this would happen for the RL system, and it seems like that’s the crux of the argument. I agree that it would happen for the question-answering system (I was the first to admit that the question-answering system was hard to test).
You may have other kinds of difficulties in mind, but all of the ones that I can think of seem to rest on a treacherous turn or something similar. Is there some other reason to expect failure to be catastrophic?
I’m not pointing out any specific reasons, but just expect that in general, failures when dealing with large amounts of computing power can easily be catastrophic. You have theoretical arguments for why they won’t be, given a specific design, but again I am skeptical of such arguments in general.
I agree there is some risk that cannot be removed with either theoretical arguments or empirical evidence. But why is it greater for this kind of AI than any other, and in particular than white-box metaphilosophical or normative AI?
Normative AI seems like by far the worst, since:
it generally demonstrates a treacherous turn if you make an error,
it must work correctly across a range of unanticipated environments
So in that case we have particular concrete reasons to think that emprical testing won’t be adequate, in addition to the general concern that empirical testing and theoretical argument is never sufficient. To me, white box metaphilosophical AI seems somewhere in between.
(One complaint is that I just haven’t given an especially strong theoretical argument. I agree with that, and I hope that whatever systems people actually use, they are backed by something more convicing. But the current state of the argument seems like it can’t point in any direction other than in favor of black box designs, since we don’t yet have any arguments at all that any other kind of system could work.)
It seems like the question is: “How much more productive is the aggressive policy?”
It looks to me like the answer is “Maybe it’s 1% cheaper or something, though probably less.” In this case, it doesn’t seem like the AI itself is introducing (much of) a PD situation, and the coordination problem can probably be solved.
I don’t know whether you are disagreeing about the likley cost of the aggressive policy, or the consequences of slight productivity advantages for the aggressive policy. I discuss this issue a bit here, a post I wrote a few days ago but just got around to posting.
Of course there may be orthogonal reasons that the AI faces PD-like problems, e.g. it is possible to expand in an undesirably destructive way by building an unrelated and dangerous technology. Then either:
The alien user would want to coordinate in the prisoner’s dilemma. In this case, the AI will coordinate as well (unless it makes an error leading to a lower reward).
The alien user doesn’t want to coordinate in the prisoner’s dilemma. But in this case, the problem isn’t with the AI at all. If the users hadn’t built AI they would have faced the same problem.
I don’t know which of these you have in mind. My guess is you are thinking of (2) if anything, but this doesn’t really seem like an issue to do with AI control. Yes, the AI may have a differential effect on e.g. the availability of destructive tech and our ability to coordinate, and yes, we should try encourage differential progress in AI capabilities just like we want to encourage differential progress in society’s capabilities more broadly. But I don’t see how any solution to the AI control problem is going to address that issue, nor does it seem especially concerning when compared to the AI control problem.
Maybe we have different things in mind for “aggressive policy”. I was think something like “give the AI enough computing power to achieve superhuman intelligence so it can hopefully build a full-fledged FAI for the user” vs the “conservative policy” of “keep the AI at its current level where it seems safe, and find another way to build an FAI”.
A separate but related issue is that it appears such an AI can either be a relatively safe or unsafe AI, depending on the disposition of the overseer (since an overseer less concerned with safety would be more likely to approve of potentially unsafe modifications to the AI). In a sidenote of the linked article, you wrote about why unsafe but more efficient AI projects won’t overtake the safer AI projects in AI research:
But how will the safe projects exclude the unsafe projects from economies of scale and favorable terms of trade, if the unsafe projects are using the same basic design but just have overseers who care more about capability than safety?
Controlling the distribution of AI technology is one way to make someone’s life harder, but it’s not the only way. If we imagine a productivity gap as small as 1%, it seems like it doesn’t take much to close it.
(Disclaimer: this is unusally wild speculation; nothing I say is likely to be true, but hopefully it gives the flavor.)
If unsafe projects perfectly pretend to be safe projects, then they aren’t being more efficient. So it seems like we can assume that they are observably different from safe projects. (For example, there can’t just be complexity-loving humans who oversee projects exactly as if they had normal values; they need to skimp on oversight in order to actually be more eficient. Or else they need to differ in some other way...) If they are observably different, then possible measures include:
Even very small tax rates coupled with redistribution that is even marginally better-directed at safe projects (e.g. that goes to humans)
Regulatory measures to force everyone to incur the overhead, or most of the overhead, of being safe, e.g. lower bounds on human involvement.
Today many trades involve trust and understanding between the parties (e.g. if I go work for you). Probably some trades will retain this character. Honest people may be less happy to trade with those they expect to be malicious. I doubt this would be a huge factor, but 1% seems tiny.
Even in this scenario it may be easy to make technology which is architecturally harder to use by unsafe projects. E.g., it’s not clear whether the end user is the only overseer, or whether some oversight can be retained by law enforcement or the designers or someone else.
Of course unsafe projects can go to greater lengths in order to avoid these issues, for example by moving to friendlier jurisdictions or operating a black market in unsafe technology. But as these measures become more extreme they become increasingly easy to identify. If unsafe jurisdictions and black markets have only a few percent of the population of the world, then it’s easy to see how they could be less efficient.
(I’d also expect e.g. unsafe jurisdictions to quickly cave under international pressure, if the rents they could extract were a fraction of a percent of total productivty. They could easily be paid off, and if they didn’t want to be paid off, they would not be miliratily competitive.)
All of these measures become increasingly implausible at large productivty differentials. And I doubt that any of these particular foreseeable measures will be important. But overall, given that there are economies of scale, I find it very likely that the majority can win. The main question is whether they care enough to.
Normally I am on the other side of a discussion similar to this one, but involving much larger posited productivity gaps and a more confident claim (things are so likely to be OK that it’s not worth worrying about safety). Sorry if you were imagining a very much larger gap, so that this discussion isn’t helpful. And I do agree that there is a real possibility that things won’t be OK, even for small productivity gaps, but I feel like it’s more likely than not to be OK.
Also note that at a 1% gap, we can basically wait it out. If 10% of the world starts out malicious, then by the time the economy has grown 1000x, then 11% of the world is malicious, and it seems implausible that the AI situation won’t change during that time—certainly contemporary thinking about AI will be obsoleted, in an economic period as long as 0-2015AD. (The discussion of social coordination is more important in the case where there are larger efficiency gaps, and hence probably larger differences in how the projects look and what technology they need.)
ETA: Really the situation is not so straightforward, since 1% more productivity leads to more than 1% more profit; overall this issue really seems too complicated for this kind of vague theoretical speculation to be meaningfully accurate, but I hope I’ve given the basic flavor of my thinking.
And finally, I intended 1% as a relatively conservative estimate. I don’t see any particular reason you need to have so much waste, and I wouldn’t be surprised if it end up much lower, if future people end up pursuing some strategy along these lines.
1% seems really low to me. Suppose for example that the AI invents a modification to itself, which is meant to improve its performance. A cautious overseer might demand an explanation of the improvement and why it’s safe, in terms that he can understand, while an incautious overseer might be willing to just approve the modification right away and start using it. It seems to me that the cost of developing an understandable and convincing explanation of the improvement and its safety and then waiting for the overseer to process that, could easily be greater than 1% (or even 100%) of the cost of the inventing the improvement itself.
Also, caution/safety is a matter of degree, and it seems hard to define what “unsafe” means, for the purpose of imposing a penalty on all unsafe projects. (As you said, it would be even safer to not use any machine aid at all until much later. Where/how do you draw the line between “safe enough” and “unsafe”?) It also seems hard to tell which projects are skimping on safety from the outside (you can observe that they are advancing faster than expected given their resource base, but how do you rule out other explanations for that?), while from the inside, the workers “in the know” can probably find all kinds of justifications for doing what they are doing.
Yes, I was about to point this out. In many areas the difference between the best and the second best is already the difference between making a profit and making a loss. This will be even more true for AI, since if AIs can be copied, there won’t be many fields where it makes economic sense to hire a 99% efficient AI instead of a copy of a 100% efficient AI.
Without this issue the “wait it out” solution would seem almost completely satisfactory to me for a gap as small as 1% (or even 5%). Most of my comment was about making unsafe projects less efficient, which (I think) addresses this point.
Also note that 1% productivity benefit only leads to >> 1% efficiency if you are trading extensively with the rest of the world, so that most of your revenue is from sales (and the rest of the world can tell you are the most efficient, or at least that you are selling a lot of stuff cheaply). It’s much easier to regulate or make life harder for someone who is trading extensively with the rest of the world than someone who is operating in secret, so this seems to strengthen the arguments given in the body of my my comment.
The ideas is just to make the relative cost of safety as low as possible. In the linked post I gave a quantification of safety. So let’s consider some 99.9% safe project, and the level of oversight it requires.
If this oversight is expensive because it involves using some resource (like involving human overseers, or periodically pausing as you wait on an overseer, or whatever) then it would be sufficient to require each project to use that resource, or to provide that resource for free to any project using tax dollars, or so on.
Alternatively, if there is some kind of oversight (e.g. periodic high-intensity audits, or AI oversight of AI projects, or mandated close involvement of human auditors) then the goal would just be to ensure that the price of evading detection eats up the efficiency benefits of unsafety. This looks pretty plausible to me, but you may be more skeptical about the feasibility of oversight.
In this proposal:
A cautious overseer demands such an argument with very small probability. I’ll write soon about just how small I think this probability can safely be, I think in the original post I suggested 1%, but I think it could be much lower. If the check is only done with probability 1/10000, then it’s fine if the cost is 10,000% of the cost of implementing the project.
The human overseer has very little role in the early stages of the evaluation, and certainly they don’t have to understand the whole proposal. In my proposal as written she has to relay questions between different AI’s, but of course you could automate that completely, so that the first stages of evaluation are just done by other AI’s (whose evaluations are accountable to other AI’s, whose evaluations are accountable to other AI’s… whose evaluations are accountable to the overseer). At worst, the overseer’s role is similar to the arbirtrator in this scheme, though she has many additional techniques at her disposal.
If the world is moving fast, the main problem is probably the latency of the human evaluation. But there are a bunch of plausible-seeming techniques for getting to low latency. I hope to write about this soon as well.
ETA: What I should have said is that an overseer requests explanations very rarely during normal, intended operation. But this relies on the AI’s ability to find a policy which the overseer will predictably approve of.
Perhaps your point is that a more cautious overseer should request explanations more often, owing to the AI’s limited ability to predict the overseer’s reaction. But if so, we are going around in circles. What is the nature of these improvements, such that whether they are a good idea or not depends in such a detailed way on the values or philosophical views of the overseer? Why can’t one AI build an improved AI which also follows an innocuous policy like “don’t do anything terrible; acquire resources; let the user control those resources”?
Even if the actual biological overseer doesn’t evaluate a proposal, the AI still has to predict what the overseer’s judgement would be, if he or she were to, right? That involves modeling the overseer and the other sub-AIs (each of which is also modeling the overseer and other sub-AIs) and coming up with arguments that can convince the modeled overseer/AIs, etc. I don’t see how that could be only 1% less efficient, compared to an incautious overseer who the AI knows probably won’t demand such arguments or consult lots of sub-AIs and would be willing to try out any improvements that aren’t obviously dangerous.
Separately, the “innocuous” AI you talked about suffers from two efficiency-sapping problems: the AI isn’t sure what the overseer’s values are, and the overseer isn’t sure what his own values are. This forces the AI to avoid certain actions that may be highly efficient but potentially ethically disastrous. For example suppose the overseer isn’t sure whether negative feedback received by reinforcement-based agents has negative moral value, so the AI has to avoid building lots of reinforcement-based subagents. Another example might be that the AI faces a lot of everyday ethical questions in the course of acquiring resources, and has to take the latency hit of asking the overseer about them every time. These efficiency-sapping problems are worse for overseers with more complex ethical views, and more uncertain ethical views.
If there is competition, everyone has strong incentives to quickly build “full-fledged FAIs” which can solve these ethical problems and know exactly what they should and shouldn’t do. People who are less cautious will again have an efficiency advantage while doing this. E.g., they might be fine with building a standard utility-maximizing AI based on a crude model of their current understanding of ethics. I do not see how mandatory oversights or other social techniques can prevent this outcome, if you’re imagining a world where your AI design is being used widely. Someone could make a copy of an existing AI based on your design, change the code or configuration files to make themselves the overseer and remove the mandatory oversights, and then ask the AI to make a “full-fledged FAI” for them, and if they happen to be of the incautious type, this will probably result in the kind of crude normative AI mentioned above (or worse, if they approve a bunch of “improvements” that end up subverting their intentions altogether).
Re paragraph 3: it seems like these are mostly considerations that might strengthen your conclusions if we granted that there was a big productivity difference between my design and a “a standard utility-maximizing AI based on a crude model of their current understanding of ethics.” But I would already be happy to classify a large productivity loss as a failure, so let’s just concentrate on the claimed productivity loss.
These incentives only operate if there is a big productivity difference.
Beyond that, if the kinds of issues peope run into are “the AI faces a lot of everyday ethical questions in the course of acquiring resources,” then it really seems like what you need is a not-catastrophically-wrong model of human morality, which would probably just be built in mundane ways. I don’t see a strong argument that this would require lots of impresive conceptual progress, rather than being simiar-in-kind to building a predictive model of anything else. But I suspect this is just a reflection of the disagreement about paragraph 2, which we should flesh out in the sibling.
This doesn’t seem like a very general argument against the possibility of mandatory oversight or technological handicapping, and the measures you describe seem like strawmen. I agree that whatever kind of oversight you employ, it will be possible to subvert it, whatever tax you charge it will be possible to evade it, and so on. But doing so will often come with a cost (as it does today), and it just doesn’t seem that hard to get it up to a 1% loss (say). We could talk more about the particular measures that could be taken for oversight; I’m sure we can both imagine many regulatory and technological approaches that would be more annoying to sidestep than an entry in a configuration file, but I suspect our disagreement comes form us imagining different productivity gaps.
The prospect of someone designing their own AI, which is very architecturally different from the rest of the world, just doesn’t seem especially troubling, unless you imagine that the rest of the world is using a significantly handicapped design. (See the first sentence of this reply.)
Re paragraph 1:
I don’t think this is necessary. Let’s go back to the alien analogy. Let’s assume that you have an action that effectively implements U = “acquire resources; don’t be evil; leave the user in control.” I claimed that pursuing U is not much harder than acquiring resources on its own; for the moment can we grant that, since it seems orthogonal to the issues in this paragraph?
So you have this good action, and you are deciding what to do. I claim that you should just do it, with no extra computation. Then, if it gets evaluated, you will have to think about rhetorical strategies to use when convincing the overseer its a good idea. Also, a copy of you will have to think about how to convince the overseer that it’s a bad idea. And the overseer will need to actually think about the issue. And so on.
There are some issues that you’ll need to think about before you commit to the action. For example, if there is a knockdown argument that this action is a bad idea, then you shouldn’t do it. But that thinking seems to be a necessary part of maximizing U, not an extra step (unless the arguing system is broken).
You might be able to do even better by thinking through all of the arguments in advance, and taking an action which is not merely good but which will fare especially well during an evaluation. But if the extra computational costs of doing so are significant, this has to be weighed against the benefits of actually finding a better plan. It seems like this basically comes down to the question “If you take the very most efficient plan, and we perform some elaborate procedure for arguing, can you actually ‘convince’ the overseer that it’s a good plan?”
If the answer is “no” then you are screwed anyway, before we even start thinking about computational overhead—you simply can’t take the best plan, because you wouldn’t be able to receive a good rating if you do. And if the answer is “yes” then the problem seems to go away. The justifiability of good plans does seem to be a key question about the feasibility of this scheme, which I am happy to discuss if it’s underlying objection you have in mind.
Re paragraph 2:
If the most efficient actions are ethically disastrous, then we have a fundamental problem which seems to me to be unrelated to AI safety, and which no AI control proposal will address. For example, if the most efficient strategy is to build a virus that kills everyone but you, and an AI is considering this strategy but has to reject it because it is unethical, then we are just out of luck. We could call this a problem with “AI,” but it’s really a problem with biotechnology.
If a certain kind of reinforcement learning is especially efficient but morally unacceptable, then that seems to be the same situation. What are we supposed to do, other than either accept the moral cost or adopt a good enough social solution to overcome the efficiency gap? What kind of solution might you hope to find that would make this kind of problem go away?
If the efficient actions merely might be ethically disastrous, then I guess the cost is supposed to be the time required to clarify the overseer’s values. Which brings us to:
The question is just how many distinct questions of this form there are, and how important they are to the AI’s plans. If there were merely a billion such questions it doesn’t seem like a big deal at all (though then a significant occupation of humans would be answering moral questions).
Even that strikes me as completely implausible given our experience so far (combined with my inability to see many future examples). If I were the user, and people were trying to optimize values using the range of policies available today, then it seems like they would have had to ask me no more than a dozen or so questions to get things basically right (i.e. realizing much more than 99% of the potential value from my perspective). So this seems to require moral problems to proliferate at a much faster rate than technological problems.
Do you disagree about the importance of hard ethical questions in the situation today (e.g. I am implicitly overlooking many important issues because I’m not used to dealing with an AI), or do you just expect more proliferation in the future?
Also, the problem of predicting human moral judgments doesn’t seem to be radically harder than the problem of e.g. negotiating with humans. I guess this is just another angle on “how many distinct moral questions do you have to answer?” since the real question is how much you can generalize from each answer. I don’t feel like there are that many hard-to-predict parameters before everything reduces to easy-to-predict consequences.
Your examples of possible mistakes seemed to involve not knowing how the alien would feel about particular tradeoffs. This doesn’t seem related to how much computational power you have, except insofar as having more power might lead you to believe that it is safe to try and figure out what the alien thinks from first principles. But that’s not a necessary consequence of having more computing power, and I gave an argument that more computing power shouldn’t predictably lead to trouble.
Why do you think that more computing power requires a strategy which is “aggressive” in the sense of having a higher probability of catastrophic failure?
You might expect that building “full-fledged FAI” requires knowing a lot about the alien, and you won’t be able to figure all of that out in advance of building it. But again, I don’t understand why you can’t build an AI that implements a conservative strategy, in the sense of being quick to consult the user and unlikley to make a catastrophic error. So it seems like this just begs the question about the relative efficacy of conservative vs. aggressive strategies.