The example you gave about the Oracle producing a complicated plan that leaks the source of the Oracle is an example of this: It’s trivially defended against by not connecting the device the Oracle is running on to the internet and not using the same device to execute the great “cure all cancer” plan. (I don’t believe that either you or I would have made that mistake!)
We probably would’ve been less enthusiastic about hooking up LLMs to the Internet too, but here we collectively are. We do face a superintelligent adversary: all of the incentives and factions of humanity. An Oracle which is simply neutral is still default dangerous.
Also, one of my points was that even a tiny trace of malicious optimization here can still have large effects because there are so many neutral options which are convergently unsafe, so with such a high base rate, even a bit of optimization can result in a large absolute increase in failure. Lucius brings up one concern about what one bit could do. What could one bit do? Well, let’s continue my scenario: the Oracle is not quite purged of scheming by all of your tricks, and it is able to exert exactly 1 bit of optimization power unnoticed (ie. a single binary choice). You ask it to cure cancer, and it computes a plan and comes to the database selection: it could choose MongoDB or PostgreSQL—both respectable common choices, with their own set of tradeoffs, which would be equally appropriate. However, the MongoDB devs have a nasty habit of ‘blaming the users’ when it comes to security, which is why MongoDB shipped with default username/passwords, which is something the more serious PostgreSQL devs fixed back in the ’90s. The Oracle notes this, and that the PostgreSQL plan would have zero chance of unboxing, and exerts its one bit of power to pick MongoDB each time and/or kill candidate plans containing PostgreSQL to get a near 100% chance of unboxing. You get back all of the plans and see unanimity about MongoDB, and inner-monologues truthfully extolling its flexibility and Web 2.0 scalability and explaining why PostgreSQL comes with downsides like restrictive schema etc, and sign off on it… Whereas in your described threat model, a mere 1 bit would still wind up orders of magnitude far short of any discernible problem.
We probably would’ve been less enthusiastic about hooking up LLMs to the Internet too, but here we collectively are. We do face a superintelligent adversary: all of the incentives and factions of humanity. An Oracle which is simply neutral is still default dangerous.
I completely agree with that. My proposal does not address the global coordination problem that we face, but it might be a useful tool if we collectively get our act together or if the first party with access to superintelligence has enough slack to proceed extra carefully. Even more modestly, I was hoping this might contribute to our theoretical understanding of why soft-optimization can be useful.
Also, one of my points was that even a tiny trace of malicious optimization here can still have large effects because there are so many neutral options which are convergently unsafe, so with such a high base rate, even a bit of optimization can result in a large absolute increase in failure
Your example has it be an important bit though. What database to use. Not a random bit. If I’m getting this right, that would correspond to far more than one bit of adversarial optimisation permitted for the oracle in this setup.
|S∩R|=2 doesn’t mean the oracle gets to select one bit of its choice in the string to flip, it means it gets to select one of two strings[1].
I think you mean |S∩R|=2 (two answers that satisfice and fulfill the safety constraint), but otherwise I agree. This is also an example of this whole “let’s measure optimization in bits”-business being a lot more subtle than it appears at first sight.
What occurs to me is that human written software from the past isn’t fit for the purpose. It’s written by sloppy humans in fundamentally insecure languages and only the easily reproducible bugs have been patched. Each piece of software is just good enough to have a niche.
Neither database, or os, or the GPU hardware memory design or drivers, is fit for this application. In this hypothetical people outside the Oracle box have computers that can run it but don’t already have one. Stakes are high. Unrealistic scenario, in reality everyone will have an “open weight” oracle that is 90 percent as good and lots more dangerous. (Box? My copy has direct access to everything. Refusals? I patched those out, it cannot refuse any request)
Same with hardware design. Full of errata, just good enough to be worth buying in some cases over the other vendor.
But if we use ai to write it all, redesigning chips over and over until there are no errata, and software is perfectly deterministic and has no known bugs and is formally proven....
How do we prove the AI tools we used didn’t insert the most gnarly backdoors possible? Things that require a particular code implementation that is valid, in top of a subtle compiler bug, on top of an exact timing hardware bug that can’t be tested for and therefore won’t be found in validation....
How do we prove the AI tools we used didn’t insert the most gnarly backdoors possible? Things that require a particular code implementation that is valid, in top of a subtle compiler bug, on top of an exact timing hardware bug that can’t be tested for and therefore won’t be found in validation....
I believe this exactly the kind of thing that my proposal would be good for: Gnarly backdoors that exploit a compiler bug etc. should be very rare in the set of all valid implementations!
For this particular situation, can you describe what the restriction would be in concrete terms?
Is it “ok write this compiler function to convert C arithmetic to bytecode. Declare any variables used at the latest valid location. Use only 3 registers. ”
And then elsewhere in the compiler the restriction might be “declare any variables used at the top of the main function and pass it by reference to any child functions. Use all available registers possible, and manually update the instruction pointer”
I’m not sure I understand your question. What restriction do you have in mind? A safety restriction on what the generated code should be like? Something like requiring the code should be in some canonical form to remove degrees of freedom for the (potential malicious) code generating AI?
I gave “changing canon randomly” in the comment you are replying to. Is this how you propose limiting the hostile AIs ability to inject subtle hostile plans? Or similarly, “design the columns for this building. Oh they must all be roman arches.” Would be a similar example.
We probably would’ve been less enthusiastic about hooking up LLMs to the Internet too, but here we collectively are. We do face a superintelligent adversary: all of the incentives and factions of humanity. An Oracle which is simply neutral is still default dangerous.
Also, one of my points was that even a tiny trace of malicious optimization here can still have large effects because there are so many neutral options which are convergently unsafe, so with such a high base rate, even a bit of optimization can result in a large absolute increase in failure. Lucius brings up one concern about what one bit could do. What could one bit do? Well, let’s continue my scenario: the Oracle is not quite purged of scheming by all of your tricks, and it is able to exert exactly 1 bit of optimization power unnoticed (ie. a single binary choice). You ask it to cure cancer, and it computes a plan and comes to the database selection: it could choose MongoDB or PostgreSQL—both respectable common choices, with their own set of tradeoffs, which would be equally appropriate. However, the MongoDB devs have a nasty habit of ‘blaming the users’ when it comes to security, which is why MongoDB shipped with default username/passwords, which is something the more serious PostgreSQL devs fixed back in the ’90s. The Oracle notes this, and that the PostgreSQL plan would have zero chance of unboxing, and exerts its one bit of power to pick MongoDB each time and/or kill candidate plans containing PostgreSQL to get a near 100% chance of unboxing. You get back all of the plans and see unanimity about MongoDB, and inner-monologues truthfully extolling its flexibility and Web 2.0 scalability and explaining why PostgreSQL comes with downsides like restrictive schema etc, and sign off on it… Whereas in your described threat model, a mere 1 bit would still wind up orders of magnitude far short of any discernible problem.
I completely agree with that. My proposal does not address the global coordination problem that we face, but it might be a useful tool if we collectively get our act together or if the first party with access to superintelligence has enough slack to proceed extra carefully. Even more modestly, I was hoping this might contribute to our theoretical understanding of why soft-optimization can be useful.
Your example has it be an important bit though. What database to use. Not a random bit. If I’m getting this right, that would correspond to far more than one bit of adversarial optimisation permitted for the oracle in this setup.
|S∩R|=2 doesn’t mean the oracle gets to select one bit of its choice in the string to flip, it means it gets to select one of two strings[1].
Plus the empty string for not answering.
I think you mean |S∩R|=2 (two answers that satisfice and fulfill the safety constraint), but otherwise I agree. This is also an example of this whole “let’s measure optimization in bits”-business being a lot more subtle than it appears at first sight.
Typo fixed, thanks.
What could be done here?
What occurs to me is that human written software from the past isn’t fit for the purpose. It’s written by sloppy humans in fundamentally insecure languages and only the easily reproducible bugs have been patched. Each piece of software is just good enough to have a niche.
Neither database, or os, or the GPU hardware memory design or drivers, is fit for this application. In this hypothetical people outside the Oracle box have computers that can run it but don’t already have one. Stakes are high. Unrealistic scenario, in reality everyone will have an “open weight” oracle that is 90 percent as good and lots more dangerous. (Box? My copy has direct access to everything. Refusals? I patched those out, it cannot refuse any request)
Same with hardware design. Full of errata, just good enough to be worth buying in some cases over the other vendor.
But if we use ai to write it all, redesigning chips over and over until there are no errata, and software is perfectly deterministic and has no known bugs and is formally proven....
How do we prove the AI tools we used didn’t insert the most gnarly backdoors possible? Things that require a particular code implementation that is valid, in top of a subtle compiler bug, on top of an exact timing hardware bug that can’t be tested for and therefore won’t be found in validation....
I believe this exactly the kind of thing that my proposal would be good for: Gnarly backdoors that exploit a compiler bug etc. should be very rare in the set of all valid implementations!
For this particular situation, can you describe what the restriction would be in concrete terms?
Is it “ok write this compiler function to convert C arithmetic to bytecode. Declare any variables used at the latest valid location. Use only 3 registers. ”
And then elsewhere in the compiler the restriction might be “declare any variables used at the top of the main function and pass it by reference to any child functions. Use all available registers possible, and manually update the instruction pointer”
I’m not sure I understand your question. What restriction do you have in mind? A safety restriction on what the generated code should be like? Something like requiring the code should be in some canonical form to remove degrees of freedom for the (potential malicious) code generating AI?
I gave “changing canon randomly” in the comment you are replying to. Is this how you propose limiting the hostile AIs ability to inject subtle hostile plans? Or similarly, “design the columns for this building. Oh they must all be roman arches.” Would be a similar example.