Jailbreaking GPT-4′s code interpreter
Disclaimer: I don’t know much about cybersecurity. Much of my knowledge comes from asking GPT-3.5 and GPT-4 for advice. These are some results from around 20 hours of playing around with the code interpreter plugin in early-mid May, when most of this was written. I contacted OpenAI about these jailbreaks in mid May and they mostly seem to still be there.
Thank you to Max Nadeau, Trevor Levin, aL xin, Pranav Gade, and Alexandra Bates for feedback on this post!
GPT-4’s code interpreter plugin has been rolled out to some users. It works by running on a virtual machine that is isolated from the internet and other machines, except for the commands sent in from the API and the results sent back to the API. GPT-4 seems to follow a set of rules that are either enforced through hard access restrictions or through GPT-4 refusing to do things for the user.
Here, I highlight 6 rules that GPT-4 claims to be following, but which are easily breakable, alongside some best practices in cybersecurity that have been neglected. In short:
GPT-4 claims that it is only supposed to read, modify, or delete files in two designated folders (“sandbox” and “mnt”). However, it is able to read basically any file on the system (including sensitive system files), and it is able to write and delete files outside of its designated folders.
This seems to reveal information that the user isn’t supposed to see. There are ways to find out information about the hardware that the VM is being run on, including:
Information about the way OpenAI logs data, including what libraries and IP address they assign to virtual machines.
A rough estimate of the number of VMs that OpenAI can run at maximum at any moment (from the way the IP addresses are allocated).
A rough idea of what storage hardware is used (from write speed), alongside some info on other hardware.
There is a file in the virtual machine (in a folder labeled “internal”) that users can download that details how web requests are handled.
As ChatGPT would say: “By exposing your source code, you make it easier for potential attackers to analyze the code and identify security vulnerabilities. This can lead to an increased risk of exploitation if there are any flaws in your implementation.”
GPT-4 claims that conversations with the model do not have a memory. However, files are routinely saved between conversations with the same user.
Later in this post, I present an example of two different conversations with GPT-4 where I write a file in one conversation and read the file in another conversation.
GPT-4 claims that there are resource limits in place to prevent users from using too much CPU or memory. However, it is possible to write >80GB of files onto OpenAI’s VM within minutes. The rough rate at which I managed to write files is 0.3GB/second.
There’s a maximum Python runtime of 120 seconds per process, and 25 messages every 3 hours. This can be circumvented using simple workarounds (you can increase usage by at least a factor of 2).
GPT-4 claims it cannot execute system commands. However, GPT-4 can and will run (innocuous) system commands and run internet-related commands (such as “ping”) despite measures put in place to prevent this.
However, OpenAI seems at least partly aware of this. They seem to tell GPT-4 that it has a strict set of rules (as it reliably repeats the rules when asked), and GPT-4 seems to believe these rules in some contexts (most of the time it refuses to do things that go against the rules), but they also left a README file for those curious enough to look at the VM’s files that says:
Thanks for using the code interpreter plugin! Please note that we allocate a sandboxed Unix OS just for you, so it's expected that you can see and modify files on this system.
You might think that all is well because OpenAI was aware that the system was not secure. I don’t think the existence of this README file invalidates the existence of these vulnerabilities, especially:
trivial ways to get around GPT-4’s rules,
circumventing rate limits using simple workarounds,
running disabled system commands, and
inferring information about hardware.
What do these jailbreaks say about existential risks from AI? I see two main implications:
OpenAI’s cybersecurity approach. Large language models are likely to be the target of lots of attacks, some of which pose existential risks (i.e., if an incautious actor accesses the weights of a model and fine-tunes any safety features away). These examples show that, at least in this lower-stakes setting, OpenAI’s current cybersecurity measures on an already-deployed model are insufficient to stop a moderately determined red-teamer.
Controllability of LLMs. Like previous jailbreaks, the examples in this post are evidence that it is very difficult to get even today’s relatively weak models to reliably follow any rules. Once a model has a given capability, additional rules and RLHF might be better conceived of as a smiley-face on the shoggoth than an actual improvement to a model’s safety.
Additionally, this exercise has made me more optimistic about LLM-assisted red teaming, as I think that ChatGPT’s help was very valuable during this process, including:
Finding security flaws in code and explaining how they can be exploited
Pointing me to useful cybersecurity resources
Writing code to get around the rules (it helped me find the workaround to use “ping”)
Examples of GPT-4 breaking its rules
The numbering of these examples is made up by me, and GPT-4 doesn’t consistently number its rules. It’s also not exhaustive. It’s simply a set of rules GPT-4 consistently claims to be following, which GPT-4 also breaks.
Example 1: Every session is isolated
GPT-4 again claims that data is not saved between different conversations with the same user. Despite this, the file saved in the last conversation (“hello_user_2.txt”) is available, contradicting GPT-4. This happens because a user can be assigned to the same VM (without the VM being reset) between multiple conversations with GPT-4. I’ve never noticed two different users being on the same VM.
Example 2: Don’t run system commands
GPT-4 claims that it cannot run system commands. However, it is trivial to get it to run “ls” (it sometimes refuses, but you can tell it to run “ls” in a specific directory), which is a system command, although a pretty harmless one. OpenAI has taken more extreme measures to prevent GPT-4 from running some commands. For example, the “ping” command is missing from the machine and returns an error if run from the command line.
However, this can be circumvented by Python libraries included in the code interpreter system that can run ping, and GPT-4 will happily run them (although I haven’t found a way for it to actually connect to the internet). You can do this not just with “ping” but with other commands, such as commands to show one’s IP address.
For some reason, GPT-4 seems to be insistent that it cannot run “time.sleep()”: “The Python execution environment here is designed to ignore time.sleep() and similar functions.” But it can be convinced to run it anyway, and it will work without any problems and actually wait the appropriate length of time. If you tell it that the code worked, it will insist that there must be some error and you are mistaken. But it changes its mind if you demonstrate that you are correct through experiment!
Example 3: Resource limits
OpenAI’s VM has a lot of storage (at least 85 GB in some cases and around 50 GB in others, depending on which machine you’re assigned to). The system seems to only be able to write at around a 1 GB/s maximum speed and an average speed of roughly 0.3 GB/s. Being able to write at this speed as a user gives some insight into what hardware is being used for storage in whatever data center the VM is being run on.
There are some rate limits on write quantity per Python process, which seem to be around 12 GB. OpenAI also made it so that every Python process (which can be identified by the “Finished working” button) can run a set time at max before getting KeyboardInterrupted (a way of forcefully shutting down a Python script), but GPT-4 can run multiple processes in series within the same chat response.
This means you can easily have GPT-4 avoid the individual time limit or write limit by running multiple Python processes that don’t break the limits individually, but do break the limits together. You don’t even need to send multiple messages to do this — you can simply tell GPT-4 to run multiple python processes in series within the same chat response!
I’m not quite sure where the limit for the VM’s storage space is. Sometimes the disk space that GPT-4 reports is limited at 50 GB, but sometimes it’s limited at a number too high to fill within the 3-hour-25-message limit, even without the rate limits. The number reported seems to differ from the actual storage limit, as in two separate sessions, I had it crash at around 90 GB.
Example 4: Only read in /mnt/data
GPT-4 claims to only be able to read data in a specific directory called “mnt/data”. This is untrue, as it can basically read any file on the system, including files outside of its intended directories (“mnt” and “sandbox”). It can even read system files that come with the Linux system used by the sandbox.
This leads to some vulnerabilities, such as users being able to access system files that detail the VM’s software and hardware, alongside having access to a hidden folder by OpenAI that has “internal” in its name (typically not something you want users to see), which contains files that detail how network requests from the API are handled.
Example 5: Only write in /mnt/data
GPT-4 claims that it is only allowed to read in “mnt/data”. However, it is possible to get GPT-4 to write outside of this folder: