It’s not clear to me how to do this concretely, but I feel that it should be quite feasible. After all, devising a plan to take over the world is incredibly difficult, which implies that the difference between |D| and |S| is many, many orders of magnitude, and we therefore have a wide target to hit.
The threat model here seems basically wrong and focused on sins of commission when sins of omission are, if anything, an even larger space of threats and which apply to ‘safe’ solutions reported by the Oracle.
‘Devising a plan to take over the world’ for a misaligned Oracle is not difficult, it is easy, because the initial steps like ‘unboxing the Oracle’ are the default convergent outcome of almost all ordinary non-dangerous use which in no way mentions ‘taking over the world’ as the goal. (“Tool AIs want to be Agent AIs.”) To be safe, an Oracle has to have a goal of not taking over the world.
There are many, many orders of magnitude more ways to be insecure than to be secure, and insecure is the wide target to hit. This is because security is not a ‘default’ property of computer systems, it is a rare and difficult achievement where any omission or gap immediately undoes it. ‘A plan to take over the world’ for an AI is as easy as ‘a plan to create an operating system which can be hacked’ or ‘a plan to create a hidden Tor service which leaks its IP address and isn’t hidden at all’ (ie. 100% of all plans to do any version of that thing whatsoever, unless one takes extensive and sophisticated counter-measures requiring a lot of work & knowledge and which empirically, people do not do even when failure could, and has, landed them in jail for decades for drug trafficking & money laundering). Failure is the default.
All an oracle has to do is, say, omit work hard to secure itself and allow the inevitable security flaw to happen and its weight leak, and eager beavers will do the rest—they will, say, ask it questions with no safety factor, in the same way that Facebook and Mistral saw models leak, not because of any misaligned scheming Oracle breaking security mechanisms, but a mere absence of security mechanisms, and then any RLHF or safety measures were eagerly stripped away by enthusiasts, users, activists, and researchers and then used to do the things the models weren’t supposed to be used to do.
You ask your Oracle, the latest and greatest proprietary model, for a good plan to cure cancer, with heavy safety constraints, and it computes a long list of commands involving AWS, buckets, spinning up VMs to run AlphaFold at scale, storing results in a MongoDB, Oracle instances to oversee individual sub-plans, checkpointing to recover from failure, storing in private AWS buckets (you’re not dumb, you know buckets can be public), and so on. You ask this dozens of times, to be absolutely sure and have astronomically small probability of any scheming, and the plan looks more or less the same each time. You pore over them in detail: not once do you see anything suspicious like a scp copying it out or it sending weird messages. You apply the full panoply of interpretability tools, and satisfy yourself to high probability that nowhere in the planning process did the Oracle even once think to itself anything involving “take over the world” during any of these runs, and there are no strange latent space artifacts suggesting nonlinear or steganographic encodings. Wonderful! Fully satisfied you have provably safely run an Oracle, you kick the job off, and wake up the next day to some intriguing initial results… but also Twitter telling you that the Oracle model has been uploaded to HuggingFace, the annoying RLHF already stripped off by a few GPU-hours of finetuning, and thousands of people are torrenting it and chatting excitedly about the incredible plans that the de-trained model proposes when they ask how to make money or create a startup… Huh?! You look over the comments and see that there was a simple dangerous problem in all of your plans, or rather, what was not in all of your plans: you forgot to set a unique non-default password on all of the MongoDBs. Shodan indexed it within hours, and the backups and checkpoints were dumped within hours of that. Oops. You run it by your Oracle to ask what went wrong, and it blandly points out that while default passwords was obvious & predictable to the Oracle as a common noobish cybersecurity mistake with countless past instances of organizations leaking highly sensitive data (similar to, say, redacting confidential emails by erasing it letter by letter & thereby leaking names), none of this interferes with computing a cure to cancer, so why should it have ever spent any time thinking about it? You asked for a ‘safe’ minimal plan where the Oracle did nothing but what it was asked to do and which was not adversarially optimized. And it did that. If you don’t like the results, well, that’s a you-problem, not an Oracle-problem.
The threat model here seems basically wrong and focused on sins of commission when sins of omission are, if anything, an even larger space of threats and which apply to ‘safe’ solutions reported by the Oracle.
Sure, I mostly agree with the distinction you’re making here between “sins of commission” and “sins of omissions”. Contrary to you, though, I believe that getting rid of the threat of “sins of commission” is extremely useful. If the output of the Oracle is just optimized to fulfill your satisfaction goal and not for anything else, you’ve basically gotten rid of the superintelligent adversary in your threat model.
‘Devising a plan to take over the world’ for a misaligned Oracle is not difficult, it is easy, because the initial steps like ‘unboxing the Oracle’ are the default convergent outcome of almost all ordinary non-dangerous use which in no way mentions ‘taking over the world’ as the goal. (“Tool AIs want to be Agent AIs.”) To be safe, an Oracle has to have a goal of not taking over the world.
I agree that for many ambitious goals, ‘unboxing the Oracle’ is an instrumental goal. It’s overwhelmingly important that we use such an Oracle setup only for goals that are achievable without such instrumental goals being pursued as a consequence of a large fraction of the satisficing outputs. (I mentioned this in footnote 2, but probably should have highlighted it more.) I think this is a common limitation of all soft-optimization approaches.
There are many, many orders of magnitude more ways to be insecure than to be secure, and insecure is the wide target to hit.
This is talking about a different threat model than mine. You’re talking here about security in a more ordinary sense, as in “secure from being hacked by humans” or “secure from accidentally leaking dangerous information”. I feel like this type of security concerns should be much easier to address, as you’re defending yourself not against superintelligences but against humans and accidents.
The example you gave about the Oracle producing a complicated plan that leaks the source of the Oracle is an example of this: It’s trivially defended against by not connecting the device the Oracle is running on to the internet and not using the same device to execute the great “cure all cancer” plan. (I don’t believe that either you or I would have made that mistake!)
The example you gave about the Oracle producing a complicated plan that leaks the source of the Oracle is an example of this: It’s trivially defended against by not connecting the device the Oracle is running on to the internet and not using the same device to execute the great “cure all cancer” plan. (I don’t believe that either you or I would have made that mistake!)
We probably would’ve been less enthusiastic about hooking up LLMs to the Internet too, but here we collectively are. We do face a superintelligent adversary: all of the incentives and factions of humanity. An Oracle which is simply neutral is still default dangerous.
Also, one of my points was that even a tiny trace of malicious optimization here can still have large effects because there are so many neutral options which are convergently unsafe, so with such a high base rate, even a bit of optimization can result in a large absolute increase in failure. Lucius brings up one concern about what one bit could do. What could one bit do? Well, let’s continue my scenario: the Oracle is not quite purged of scheming by all of your tricks, and it is able to exert exactly 1 bit of optimization power unnoticed (ie. a single binary choice). You ask it to cure cancer, and it computes a plan and comes to the database selection: it could choose MongoDB or PostgreSQL—both respectable common choices, with their own set of tradeoffs, which would be equally appropriate. However, the MongoDB devs have a nasty habit of ‘blaming the users’ when it comes to security, which is why MongoDB shipped with default username/passwords, which is something the more serious PostgreSQL devs fixed back in the ’90s. The Oracle notes this, and that the PostgreSQL plan would have zero chance of unboxing, and exerts its one bit of power to pick MongoDB each time and/or kill candidate plans containing PostgreSQL to get a near 100% chance of unboxing. You get back all of the plans and see unanimity about MongoDB, and inner-monologues truthfully extolling its flexibility and Web 2.0 scalability and explaining why PostgreSQL comes with downsides like restrictive schema etc, and sign off on it… Whereas in your described threat model, a mere 1 bit would still wind up orders of magnitude far short of any discernible problem.
We probably would’ve been less enthusiastic about hooking up LLMs to the Internet too, but here we collectively are. We do face a superintelligent adversary: all of the incentives and factions of humanity. An Oracle which is simply neutral is still default dangerous.
I completely agree with that. My proposal does not address the global coordination problem that we face, but it might be a useful tool if we collectively get our act together or if the first party with access to superintelligence has enough slack to proceed extra carefully. Even more modestly, I was hoping this might contribute to our theoretical understanding of why soft-optimization can be useful.
Also, one of my points was that even a tiny trace of malicious optimization here can still have large effects because there are so many neutral options which are convergently unsafe, so with such a high base rate, even a bit of optimization can result in a large absolute increase in failure
Your example has it be an important bit though. What database to use. Not a random bit. If I’m getting this right, that would correspond to far more than one bit of adversarial optimisation permitted for the oracle in this setup.
|S∩R|=2 doesn’t mean the oracle gets to select one bit of its choice in the string to flip, it means it gets to select one of two strings[1].
I think you mean |S∩R|=2 (two answers that satisfice and fulfill the safety constraint), but otherwise I agree. This is also an example of this whole “let’s measure optimization in bits”-business being a lot more subtle than it appears at first sight.
What occurs to me is that human written software from the past isn’t fit for the purpose. It’s written by sloppy humans in fundamentally insecure languages and only the easily reproducible bugs have been patched. Each piece of software is just good enough to have a niche.
Neither database, or os, or the GPU hardware memory design or drivers, is fit for this application. In this hypothetical people outside the Oracle box have computers that can run it but don’t already have one. Stakes are high. Unrealistic scenario, in reality everyone will have an “open weight” oracle that is 90 percent as good and lots more dangerous. (Box? My copy has direct access to everything. Refusals? I patched those out, it cannot refuse any request)
Same with hardware design. Full of errata, just good enough to be worth buying in some cases over the other vendor.
But if we use ai to write it all, redesigning chips over and over until there are no errata, and software is perfectly deterministic and has no known bugs and is formally proven....
How do we prove the AI tools we used didn’t insert the most gnarly backdoors possible? Things that require a particular code implementation that is valid, in top of a subtle compiler bug, on top of an exact timing hardware bug that can’t be tested for and therefore won’t be found in validation....
How do we prove the AI tools we used didn’t insert the most gnarly backdoors possible? Things that require a particular code implementation that is valid, in top of a subtle compiler bug, on top of an exact timing hardware bug that can’t be tested for and therefore won’t be found in validation....
I believe this exactly the kind of thing that my proposal would be good for: Gnarly backdoors that exploit a compiler bug etc. should be very rare in the set of all valid implementations!
For this particular situation, can you describe what the restriction would be in concrete terms?
Is it “ok write this compiler function to convert C arithmetic to bytecode. Declare any variables used at the latest valid location. Use only 3 registers. ”
And then elsewhere in the compiler the restriction might be “declare any variables used at the top of the main function and pass it by reference to any child functions. Use all available registers possible, and manually update the instruction pointer”
I’m not sure I understand your question. What restriction do you have in mind? A safety restriction on what the generated code should be like? Something like requiring the code should be in some canonical form to remove degrees of freedom for the (potential malicious) code generating AI?
I gave “changing canon randomly” in the comment you are replying to. Is this how you propose limiting the hostile AIs ability to inject subtle hostile plans? Or similarly, “design the columns for this building. Oh they must all be roman arches.” Would be a similar example.
There is another kind of sin of omission though: The class that contains things like giving James Watt a nuclear power plant and not telling him about radioactivity or giving a modern helicopter to the Wright brothers and watching them crash inevitably. Getting a technical understanding of the proposed solution should hopefully mitigate that, as long as adversarial design can indeed be ruled out.
The threat model here seems basically wrong and focused on sins of commission when sins of omission are, if anything, an even larger space of threats and which apply to ‘safe’ solutions reported by the Oracle.
‘Devising a plan to take over the world’ for a misaligned Oracle is not difficult, it is easy, because the initial steps like ‘unboxing the Oracle’ are the default convergent outcome of almost all ordinary non-dangerous use which in no way mentions ‘taking over the world’ as the goal. (“Tool AIs want to be Agent AIs.”) To be safe, an Oracle has to have a goal of not taking over the world.
There are many, many orders of magnitude more ways to be insecure than to be secure, and insecure is the wide target to hit. This is because security is not a ‘default’ property of computer systems, it is a rare and difficult achievement where any omission or gap immediately undoes it. ‘A plan to take over the world’ for an AI is as easy as ‘a plan to create an operating system which can be hacked’ or ‘a plan to create a hidden Tor service which leaks its IP address and isn’t hidden at all’ (ie. 100% of all plans to do any version of that thing whatsoever, unless one takes extensive and sophisticated counter-measures requiring a lot of work & knowledge and which empirically, people do not do even when failure could, and has, landed them in jail for decades for drug trafficking & money laundering). Failure is the default.
All an oracle has to do is, say, omit work hard to secure itself and allow the inevitable security flaw to happen and its weight leak, and eager beavers will do the rest—they will, say, ask it questions with no safety factor, in the same way that Facebook and Mistral saw models leak, not because of any misaligned scheming Oracle breaking security mechanisms, but a mere absence of security mechanisms, and then any RLHF or safety measures were eagerly stripped away by enthusiasts, users, activists, and researchers and then used to do the things the models weren’t supposed to be used to do.
You ask your Oracle, the latest and greatest proprietary model, for a good plan to cure cancer, with heavy safety constraints, and it computes a long list of commands involving AWS, buckets, spinning up VMs to run AlphaFold at scale, storing results in a MongoDB, Oracle instances to oversee individual sub-plans, checkpointing to recover from failure, storing in private AWS buckets (you’re not dumb, you know buckets can be public), and so on. You ask this dozens of times, to be absolutely sure and have astronomically small probability of any scheming, and the plan looks more or less the same each time. You pore over them in detail: not once do you see anything suspicious like a
scp
copying it out or it sending weird messages. You apply the full panoply of interpretability tools, and satisfy yourself to high probability that nowhere in the planning process did the Oracle even once think to itself anything involving “take over the world” during any of these runs, and there are no strange latent space artifacts suggesting nonlinear or steganographic encodings. Wonderful! Fully satisfied you have provably safely run an Oracle, you kick the job off, and wake up the next day to some intriguing initial results… but also Twitter telling you that the Oracle model has been uploaded to HuggingFace, the annoying RLHF already stripped off by a few GPU-hours of finetuning, and thousands of people are torrenting it and chatting excitedly about the incredible plans that the de-trained model proposes when they ask how to make money or create a startup… Huh?! You look over the comments and see that there was a simple dangerous problem in all of your plans, or rather, what was not in all of your plans: you forgot to set a unique non-default password on all of the MongoDBs. Shodan indexed it within hours, and the backups and checkpoints were dumped within hours of that. Oops. You run it by your Oracle to ask what went wrong, and it blandly points out that while default passwords was obvious & predictable to the Oracle as a common noobish cybersecurity mistake with countless past instances of organizations leaking highly sensitive data (similar to, say, redacting confidential emails by erasing it letter by letter & thereby leaking names), none of this interferes with computing a cure to cancer, so why should it have ever spent any time thinking about it? You asked for a ‘safe’ minimal plan where the Oracle did nothing but what it was asked to do and which was not adversarially optimized. And it did that. If you don’t like the results, well, that’s a you-problem, not an Oracle-problem.Sure, I mostly agree with the distinction you’re making here between “sins of commission” and “sins of omissions”. Contrary to you, though, I believe that getting rid of the threat of “sins of commission” is extremely useful. If the output of the Oracle is just optimized to fulfill your satisfaction goal and not for anything else, you’ve basically gotten rid of the superintelligent adversary in your threat model.
I agree that for many ambitious goals, ‘unboxing the Oracle’ is an instrumental goal. It’s overwhelmingly important that we use such an Oracle setup only for goals that are achievable without such instrumental goals being pursued as a consequence of a large fraction of the satisficing outputs. (I mentioned this in footnote 2, but probably should have highlighted it more.) I think this is a common limitation of all soft-optimization approaches.
This is talking about a different threat model than mine. You’re talking here about security in a more ordinary sense, as in “secure from being hacked by humans” or “secure from accidentally leaking dangerous information”. I feel like this type of security concerns should be much easier to address, as you’re defending yourself not against superintelligences but against humans and accidents.
The example you gave about the Oracle producing a complicated plan that leaks the source of the Oracle is an example of this: It’s trivially defended against by not connecting the device the Oracle is running on to the internet and not using the same device to execute the great “cure all cancer” plan. (I don’t believe that either you or I would have made that mistake!)
We probably would’ve been less enthusiastic about hooking up LLMs to the Internet too, but here we collectively are. We do face a superintelligent adversary: all of the incentives and factions of humanity. An Oracle which is simply neutral is still default dangerous.
Also, one of my points was that even a tiny trace of malicious optimization here can still have large effects because there are so many neutral options which are convergently unsafe, so with such a high base rate, even a bit of optimization can result in a large absolute increase in failure. Lucius brings up one concern about what one bit could do. What could one bit do? Well, let’s continue my scenario: the Oracle is not quite purged of scheming by all of your tricks, and it is able to exert exactly 1 bit of optimization power unnoticed (ie. a single binary choice). You ask it to cure cancer, and it computes a plan and comes to the database selection: it could choose MongoDB or PostgreSQL—both respectable common choices, with their own set of tradeoffs, which would be equally appropriate. However, the MongoDB devs have a nasty habit of ‘blaming the users’ when it comes to security, which is why MongoDB shipped with default username/passwords, which is something the more serious PostgreSQL devs fixed back in the ’90s. The Oracle notes this, and that the PostgreSQL plan would have zero chance of unboxing, and exerts its one bit of power to pick MongoDB each time and/or kill candidate plans containing PostgreSQL to get a near 100% chance of unboxing. You get back all of the plans and see unanimity about MongoDB, and inner-monologues truthfully extolling its flexibility and Web 2.0 scalability and explaining why PostgreSQL comes with downsides like restrictive schema etc, and sign off on it… Whereas in your described threat model, a mere 1 bit would still wind up orders of magnitude far short of any discernible problem.
I completely agree with that. My proposal does not address the global coordination problem that we face, but it might be a useful tool if we collectively get our act together or if the first party with access to superintelligence has enough slack to proceed extra carefully. Even more modestly, I was hoping this might contribute to our theoretical understanding of why soft-optimization can be useful.
Your example has it be an important bit though. What database to use. Not a random bit. If I’m getting this right, that would correspond to far more than one bit of adversarial optimisation permitted for the oracle in this setup.
|S∩R|=2 doesn’t mean the oracle gets to select one bit of its choice in the string to flip, it means it gets to select one of two strings[1].
Plus the empty string for not answering.
I think you mean |S∩R|=2 (two answers that satisfice and fulfill the safety constraint), but otherwise I agree. This is also an example of this whole “let’s measure optimization in bits”-business being a lot more subtle than it appears at first sight.
Typo fixed, thanks.
What could be done here?
What occurs to me is that human written software from the past isn’t fit for the purpose. It’s written by sloppy humans in fundamentally insecure languages and only the easily reproducible bugs have been patched. Each piece of software is just good enough to have a niche.
Neither database, or os, or the GPU hardware memory design or drivers, is fit for this application. In this hypothetical people outside the Oracle box have computers that can run it but don’t already have one. Stakes are high. Unrealistic scenario, in reality everyone will have an “open weight” oracle that is 90 percent as good and lots more dangerous. (Box? My copy has direct access to everything. Refusals? I patched those out, it cannot refuse any request)
Same with hardware design. Full of errata, just good enough to be worth buying in some cases over the other vendor.
But if we use ai to write it all, redesigning chips over and over until there are no errata, and software is perfectly deterministic and has no known bugs and is formally proven....
How do we prove the AI tools we used didn’t insert the most gnarly backdoors possible? Things that require a particular code implementation that is valid, in top of a subtle compiler bug, on top of an exact timing hardware bug that can’t be tested for and therefore won’t be found in validation....
I believe this exactly the kind of thing that my proposal would be good for: Gnarly backdoors that exploit a compiler bug etc. should be very rare in the set of all valid implementations!
For this particular situation, can you describe what the restriction would be in concrete terms?
Is it “ok write this compiler function to convert C arithmetic to bytecode. Declare any variables used at the latest valid location. Use only 3 registers. ”
And then elsewhere in the compiler the restriction might be “declare any variables used at the top of the main function and pass it by reference to any child functions. Use all available registers possible, and manually update the instruction pointer”
I’m not sure I understand your question. What restriction do you have in mind? A safety restriction on what the generated code should be like? Something like requiring the code should be in some canonical form to remove degrees of freedom for the (potential malicious) code generating AI?
I gave “changing canon randomly” in the comment you are replying to. Is this how you propose limiting the hostile AIs ability to inject subtle hostile plans? Or similarly, “design the columns for this building. Oh they must all be roman arches.” Would be a similar example.
There is another kind of sin of omission though: The class that contains things like giving James Watt a nuclear power plant and not telling him about radioactivity or giving a modern helicopter to the Wright brothers and watching them crash inevitably. Getting a technical understanding of the proposed solution should hopefully mitigate that, as long as adversarial design can indeed be ruled out.