AI Safety bounty for practical homomorphic encryption

Update: Paul Christiano says this is not a good idea.

This is a random idea I had yesterday while browsing AI risk posts on LW. Wished to get feedback.

Paul Christiano suggests boxing AI via fully homomorphic encryption (FHE). This is intended to reduce the likelihood that an advanced AI that is boxed is able to exploit bugs and gain root access to the machine it is running on. Cyberhacking and human persuasion are considered among the major routes for an AI to escape its box, and this suggestion discusses both of them.

FHE is today exceedingly impractical to run. I suggest a monetary bounty for work on making FHE more practical to run, and also on theoretical work that proves the soundnesss of FHE.

How does FHE actually work?

I would highly recommend reading Paul Christiano’s post before reading further.

FHE essentially allows you to compute functions over encrypted inputs. Suppose you want another person to compute f(x) for you, but you don’t want to reveal the value x to them. You can encrypt x using public-key cryptography to generate enc(x). You can give enc(x) and f to the other person. You alone have the private key, however. The other person can use FHE to compute enc(f(x)). Then they return the value enc(f(x)) to you. You can now use your private key to decrypt this value and obtain f(x).

In this particular situation, f contains the AI architecture and the training process used to train it. You can ask a boxed machine (could be airgapped and headless) to run f till it obtains enc(f(x)) using FHE. It can now use FHE to check any conditions on the value of enc(f(x)), such as some pre-determined conditions on whether some “safe plan” was discovered by the AI. You can then express severe restrictions on the output that is to be saved to disk, for instance you may want only a few bits of output to be saved. This restriction may even be enforced in hardware. This restriction ensures the output is not hyper-persuasive to humans. Once the output is provided, the machine can be destroyed or disconnected, and we could read the saved values.

There may also be uses for FHE in boxing AI (or digital minds) that are different from the exact situation described above. For instance FHE has been also discussed in the context of secure homes of digital people. These other situations may also be more likely to occur, but I haven’t read enough on this so I am just speculating.

Suggestion: Bounties for FHE

I propose broadly two bounties (or classes of bounties):

  1. Bounty for inventing an FHE scheme that is practical in terms of time and memory used, and data that can be processed.

  2. Bounty for provable guarantees on the safety of an FHE scheme that do not rely on open questions in complexity theory such as the difficulty of lattice problems.

Why a bounty on this problem specifically?

Some reasons:

  1. Does not involve philosophical vagueness, unlike many other suggestions for AI alignment bounties. Success criteria for awarding bounty can be objectively defined and everyone can agree on them.

  2. Can possibly have a tangible impact on the ultimate safety of advanced AI systems in deployment.

  3. People aiming for the bounty don’t need to know about, care, or agree with the arguments in favour of AI risk. This increases the target audience for this bounty beyond the AI risk community.

  4. People attempting this bounty may not have large opportunity costs in terms of the impact they could have if they did something else. I’m not sure on this part, but as far as I know, while AI safety work needs more infosec professionals, there aren’t many opportunities for pure cryptographers or number-theoreticians. I wonder if this could be sufficiently a high impact opportunity for the latter, such that them working on this as opposed to some other impactful job makes sense.

Success criteria for awarding bounty

Further details on how to define these questions can be discussed. For instance defining the criteria for practical FHE in terms of worst-case big-O complexity has the upside that it is easier for cryptographers to prove, but the downside that we may get a winning scheme that has good worst-case complexity but is still impractical to implement.

Incentive format of the bounty

This bounty suffers from problems that bounties in general have. For instance that people who wish to dedicate time and resources to a problem with very low odds of success, may want funding when they start, and not after they succeed.

Regular funding for cryptographers would ofcourse be better than a one-off bounty, although I do not feel knowledgible enough to comment on whether it makes sense to do this given EA’s current cost-effectiveness bar. (And a funder who is not an EA could ofcourse evaluate this opportunity differently than EAs would.)

Also it is likely that many of the people who are the target audience for this bounty will be academicians, hence less influenced by money. This still doesn’t mean they wouldn’t care at all. Also they may care about prestige attached to the bounty, which is something the bounty administrator could separately take care of maintaining.

I do not qualified to offer an opinion on the exact size of the bounty. Quantifying the impact of practical FHE on AI x-risk would be a first step, but there may also be other considerations. For instance one may want to understand how motivation maps to bounty size—it is possible that a $20M bounty and $200M bounty do not lead to meaningful differences in the motivations of the average cryptographer working on this problem. Or vice versa they may. Also larger bounty sizes may cause people to talk more about the bounty, including in circles outside cryptography. This could also have impact of attracting people to work on AI safety.

The bounty could be offered in stock or bonds instead of cash, to reduce opportunity costs of funders. There is still an opportunity cost however, as this money will have to be set aside until the bounty is claimed or cancelled.

Red-teaming: Some more reasons not to offer this bounty

  • FHE may not actually reduce the attack surface that much, and there are technical details that I’ve missed that invalidate the whole solution (FHE for boxing AI) discussed.

  • The 2nd problem (FHE that does not rely on open problems in complexity theory) may be impractically hard to solve, and if it so, then directing people to work on it may be suboptimal impact-wise.

  • Cryptographers may find the motivation of the funders for proposing this bounty alien or even low-status. More generally, all the downsides of increasing publicity of AI risk arguments among an audience that does not understand AI apply here.

  • It may be better to motivate cryptographers to work on other infosec problems that are more tractable and relevant to AI safety.

  • Even practical FHE suffers significant alignment tax. For instance an FHE scheme that causes only a 100x or even a 3x slowdown in computation would be a major breakthrough, and could be considered “practical”. But if training AI is compute-limited, requiring a leading lab to spent 3x more compute than it otherwise would, to train AI safely, may still be impractical.

I can’t immediately think of more downsides, although I suspect more exist.

Feedback

I have not spent enough time on this idea or gotten enough feedback to actually propose this as a bounty yet. I am primarily interested in feedback in this stage. Do offer your feedback!