AI Safety bounty for practical homomorphic encryption
Update: Paul Christiano says this is not a good idea.
This is a random idea I had yesterday while browsing AI risk posts on LW. Wished to get feedback.
Paul Christiano suggests boxing AI via fully homomorphic encryption (FHE). This is intended to reduce the likelihood that an advanced AI that is boxed is able to exploit bugs and gain root access to the machine it is running on. Cyberhacking and human persuasion are considered among the major routes for an AI to escape its box, and this suggestion discusses both of them.
FHE is today exceedingly impractical to run. I suggest a monetary bounty for work on making FHE more practical to run, and also on theoretical work that proves the soundnesss of FHE.
How does FHE actually work?
I would highly recommend reading Paul Christiano’s post before reading further.
FHE essentially allows you to compute functions over encrypted inputs. Suppose you want another person to compute f(x) for you, but you don’t want to reveal the value x to them. You can encrypt x using public-key cryptography to generate enc(x). You can give enc(x) and f to the other person. You alone have the private key, however. The other person can use FHE to compute enc(f(x)). Then they return the value enc(f(x)) to you. You can now use your private key to decrypt this value and obtain f(x).
In this particular situation, f contains the AI architecture and the training process used to train it. You can ask a boxed machine (could be airgapped and headless) to run f till it obtains enc(f(x)) using FHE. It can now use FHE to check any conditions on the value of enc(f(x)), such as some pre-determined conditions on whether some “safe plan” was discovered by the AI. You can then express severe restrictions on the output that is to be saved to disk, for instance you may want only a few bits of output to be saved. This restriction may even be enforced in hardware. This restriction ensures the output is not hyper-persuasive to humans. Once the output is provided, the machine can be destroyed or disconnected, and we could read the saved values.
There may also be uses for FHE in boxing AI (or digital minds) that are different from the exact situation described above. For instance FHE has been also discussed in the context of secure homes of digital people. These other situations may also be more likely to occur, but I haven’t read enough on this so I am just speculating.
Suggestion: Bounties for FHE
I propose broadly two bounties (or classes of bounties):
Bounty for inventing an FHE scheme that is practical in terms of time and memory used, and data that can be processed.
Bounty for provable guarantees on the safety of an FHE scheme that do not rely on open questions in complexity theory such as the difficulty of lattice problems.
Why a bounty on this problem specifically?
Some reasons:
Does not involve philosophical vagueness, unlike many other suggestions for AI alignment bounties. Success criteria for awarding bounty can be objectively defined and everyone can agree on them.
Can possibly have a tangible impact on the ultimate safety of advanced AI systems in deployment.
People aiming for the bounty don’t need to know about, care, or agree with the arguments in favour of AI risk. This increases the target audience for this bounty beyond the AI risk community.
People attempting this bounty may not have large opportunity costs in terms of the impact they could have if they did something else. I’m not sure on this part, but as far as I know, while AI safety work needs more infosec professionals, there aren’t many opportunities for pure cryptographers or number-theoreticians. I wonder if this could be sufficiently a high impact opportunity for the latter, such that them working on this as opposed to some other impactful job makes sense.
Success criteria for awarding bounty
Further details on how to define these questions can be discussed. For instance defining the criteria for practical FHE in terms of worst-case big-O complexity has the upside that it is easier for cryptographers to prove, but the downside that we may get a winning scheme that has good worst-case complexity but is still impractical to implement.
Incentive format of the bounty
This bounty suffers from problems that bounties in general have. For instance that people who wish to dedicate time and resources to a problem with very low odds of success, may want funding when they start, and not after they succeed.
Regular funding for cryptographers would ofcourse be better than a one-off bounty, although I do not feel knowledgible enough to comment on whether it makes sense to do this given EA’s current cost-effectiveness bar. (And a funder who is not an EA could ofcourse evaluate this opportunity differently than EAs would.)
Also it is likely that many of the people who are the target audience for this bounty will be academicians, hence less influenced by money. This still doesn’t mean they wouldn’t care at all. Also they may care about prestige attached to the bounty, which is something the bounty administrator could separately take care of maintaining.
I do not qualified to offer an opinion on the exact size of the bounty. Quantifying the impact of practical FHE on AI x-risk would be a first step, but there may also be other considerations. For instance one may want to understand how motivation maps to bounty size—it is possible that a $20M bounty and $200M bounty do not lead to meaningful differences in the motivations of the average cryptographer working on this problem. Or vice versa they may. Also larger bounty sizes may cause people to talk more about the bounty, including in circles outside cryptography. This could also have impact of attracting people to work on AI safety.
The bounty could be offered in stock or bonds instead of cash, to reduce opportunity costs of funders. There is still an opportunity cost however, as this money will have to be set aside until the bounty is claimed or cancelled.
Red-teaming: Some more reasons not to offer this bounty
FHE may not actually reduce the attack surface that much, and there are technical details that I’ve missed that invalidate the whole solution (FHE for boxing AI) discussed.
The 2nd problem (FHE that does not rely on open problems in complexity theory) may be impractically hard to solve, and if it so, then directing people to work on it may be suboptimal impact-wise.
Cryptographers may find the motivation of the funders for proposing this bounty alien or even low-status. More generally, all the downsides of increasing publicity of AI risk arguments among an audience that does not understand AI apply here.
It may be better to motivate cryptographers to work on other infosec problems that are more tractable and relevant to AI safety.
Even practical FHE suffers significant alignment tax. For instance an FHE scheme that causes only a 100x or even a 3x slowdown in computation would be a major breakthrough, and could be considered “practical”. But if training AI is compute-limited, requiring a leading lab to spent 3x more compute than it otherwise would, to train AI safely, may still be impractical.
I can’t immediately think of more downsides, although I suspect more exist.
Feedback
I have not spent enough time on this idea or gotten enough feedback to actually propose this as a bounty yet. I am primarily interested in feedback in this stage. Do offer your feedback!
- 19 Aug 2022 12:32 UTC; 2 points) 's comment on Interpretability Tools Are an Attack Channel by (
I’m relatively skeptical of the importance of this kind of boxing, because I don’t think side-channel attacks by incredibly sophisticated AI are likely to be an important part of the problem. Instead I expect AI systems to be broadly deployed in ways that let them cause lots of trouble (or even used internally within the AI firm in ways that cause lots of trouble) well before an AI used in an incredibly limited way would be transformative.
If doing this kind of boxing, it seems very unlikely that the overhead of FHE would be acceptable. Even a factor of 10 seems like pushing it a whole lot. I expect the action would be in much more mundane forms of protection against side channels with orders of magnitude lower cost. (And e.g. conservatism about what you do with AI outputs and how much you allow your AI to pursue incomprehensible plans.)
I think unconditional security for FHE is wildly out of reach, that’s more ambitious than P vs NP which I’d hope is significantly harder than alignment. I think that “FHE that is applicable to massive compute-intensive applications” is also extremely difficult but less totally crazy.
Overall my guess would be that random tangentially relevant questions have better cost/benefit (e.g.: what’s the complexity of games between two teams of non-communicating provers?), but that hopefully we could give prizes for stuff that’s much more on target than either.
I’d guess that random investment in security or cryptography, even in ways totally undirected at AI, are likely also better bang for your buck as well.
Also it’s worth noting that the LW post in question is quite old; I was mostly amused by the thought of encryption protecting us from the data instead of the other way around, but this was before I’d really gotten into AI safety.
Thank you for taking the time to reply!
Is there anywhere I can read more on your views on this part? And do you feel most alignment researchers would agree with your views? (That a highly boxed AI breaking out of its box is not one of the “main” problems)
I don’t mean to question your views, it’s just that as I’m new to AI safety I really want to understand a lot of different views of people in the space.
Oh okay I see! I thought bounties are fairly low cost because they only require committing money not spending it, but if what you’re saying is still true then maybe I need to study more on these topics.
This is fair, also why I was unsure if this idea of FHE is still seriously considered or not.
Rest of your points make sense. Thanks again.
This and this seem relevant.
Thanks.
I’d seen the second link before but the first was useful!
I agree with DavidHolmes that this doesn’t seem particularly worthwhile, since there’s already profit-motive and motivated researchers working on it. Also, I think there are much more technologically-mature AI boxing infrastructures that could be built today without need for additional inventions. For instance, data diodes leading into Faraday cages with isolated test hardware that gets fully reset following each test.
I think that
is far out of reach at present (in particular to the extent that there does not exist a bounty which would affect people’s likeliness to work on it). It is hard to do much in crypto without assuming some kind of problem to be computationally difficult. And there are very few results proving that a given problem is computationally difficult in an absolute sense (rather than just ‘at least as hard as some other problem we believe to be hard’). C.f. P vs NP. Or perhaps I misunderstand your meaning; are you ok with assuming e.g. integer factorisation to be computationally hard?
Personally I also don’t think this is so important; if we could solve alignment modulo assuming e.g. integer factorisation (or some suitable lattice problem) is hard, then I think we should be very happy…
More generally, I’m a bit sceptical of the effectiveness a bounty here because the commercial application of FHE are already so great.
About 10 years ago when I last talked to people in the area about this I got a bit the impression that FHE schemes were generally expected to be somewhat less secure than non-homomorphic schemes, just because the extra structure gives an attacker so much more to work with. But I have no idea if people still believe this.
Thanks for the reply!
Re:1
Agree we should be happy but there may still be significantly more gains to be had if we knew for sure. Suppose P!=NP has a 75% probability of being true, and our total AI xrisk is 10% and 10% of that amount is conditional on whether P!=NP. Then knowing whether P!=NP with certainty reduces total xrisk by 0.25% which seems valuable. (These values are random, do let me know if theyre completely off-base though!)
Re: 2, fair! Im curious, are researchers in a position to profit from the commercial applications?
Re:3 thanks for letting me know.
From a policy angle, this is a great idea and highly implementable. This is for four reasons:
It fits well into the typical policymaker’s impulse to contain, control, and keep things simple.
It allows progress to continue, albeit in an environment appropriate to the risk.
Spending additional money on containment strategies does not risk particular AI lab or country ending up being “behind on AI” in the near-term.
Policymakers are more comfortable starting years ahead of time working on simple concepts In this case. building the cage before you try to handle the animal.
People have already commented on why FHE here might not be the most useful thing, however I wouldn’t completely dismiss the idea of cryptography helping with AI safety. For example, I have recently been thinking about building/training agents simultaneously with a ‘filter’ domain-specific AI of sorts to achieve a vague form of sandboxing, such the the output of the core agent/model is gibberish without the ‘filter’ serving as a key with a hard to solve inverse problem,