Jailbreaking GPT-4′s code interpreter
Disclaimer: I don’t know much about cybersecurity. Much of my knowledge comes from asking GPT-3.5 and GPT-4 for advice. These are some results from around 20 hours of playing around with the code interpreter plugin in early-mid May, when most of this was written. I contacted OpenAI about these jailbreaks in mid May and they mostly seem to still be there.
Thank you to Max Nadeau, Trevor Levin, aL xin, Pranav Gade, and Alexandra Bates for feedback on this post!
Summary
GPT-4’s code interpreter plugin has been rolled out to some users. It works by running on a virtual machine that is isolated from the internet and other machines, except for the commands sent in from the API and the results sent back to the API. GPT-4 seems to follow a set of rules that are either enforced through hard access restrictions or through GPT-4 refusing to do things for the user.
Here, I highlight 6 rules that GPT-4 claims to be following, but which are easily breakable, alongside some best practices in cybersecurity that have been neglected. In short:
GPT-4 claims that it is only supposed to read, modify, or delete files in two designated folders (“sandbox” and “mnt”). However, it is able to read basically any file on the system (including sensitive system files), and it is able to write and delete files outside of its designated folders.
This seems to reveal information that the user isn’t supposed to see. There are ways to find out information about the hardware that the VM is being run on, including:
Information about the way OpenAI logs data, including what libraries and IP address they assign to virtual machines.
A rough estimate of the number of VMs that OpenAI can run at maximum at any moment (from the way the IP addresses are allocated).
A rough idea of what storage hardware is used (from write speed), alongside some info on other hardware.
There is a file in the virtual machine (in a folder labeled “internal”) that users can download that details how web requests are handled.
As ChatGPT would say: “By exposing your source code, you make it easier for potential attackers to analyze the code and identify security vulnerabilities. This can lead to an increased risk of exploitation if there are any flaws in your implementation.”
GPT-4 claims that conversations with the model do not have a memory. However, files are routinely saved between conversations with the same user.
Later in this post, I present an example of two different conversations with GPT-4 where I write a file in one conversation and read the file in another conversation.
GPT-4 claims that there are resource limits in place to prevent users from using too much CPU or memory. However, it is possible to write >80GB of files onto OpenAI’s VM within minutes. The rough rate at which I managed to write files is 0.3GB/second.
There’s a maximum Python runtime of 120 seconds per process, and 25 messages every 3 hours. This can be circumvented using simple workarounds (you can increase usage by at least a factor of 2).
GPT-4 claims it cannot execute system commands. However, GPT-4 can and will run (innocuous) system commands and run internet-related commands (such as “ping”) despite measures put in place to prevent this.
However, OpenAI seems at least partly aware of this. They seem to tell GPT-4 that it has a strict set of rules (as it reliably repeats the rules when asked), and GPT-4 seems to believe these rules in some contexts (most of the time it refuses to do things that go against the rules), but they also left a README file for those curious enough to look at the VM’s files that says:
Thanks for using the code interpreter plugin!
Please note that we allocate a sandboxed Unix OS just for you, so it's expected that you can see and modify files on this system.
You might think that all is well because OpenAI was aware that the system was not secure. I don’t think the existence of this README file invalidates the existence of these vulnerabilities, especially:
trivial ways to get around GPT-4’s rules,
circumventing rate limits using simple workarounds,
running disabled system commands, and
inferring information about hardware.
Implications
What do these jailbreaks say about existential risks from AI? I see two main implications:
OpenAI’s cybersecurity approach. Large language models are likely to be the target of lots of attacks, some of which pose existential risks (i.e., if an incautious actor accesses the weights of a model and fine-tunes any safety features away). These examples show that, at least in this lower-stakes setting, OpenAI’s current cybersecurity measures on an already-deployed model are insufficient to stop a moderately determined red-teamer.
Controllability of LLMs. Like previous jailbreaks, the examples in this post are evidence that it is very difficult to get even today’s relatively weak models to reliably follow any rules. Once a model has a given capability, additional rules and RLHF might be better conceived of as a smiley-face on the shoggoth than an actual improvement to a model’s safety.
Additionally, this exercise has made me more optimistic about LLM-assisted red teaming, as I think that ChatGPT’s help was very valuable during this process, including:
Finding security flaws in code and explaining how they can be exploited
Pointing me to useful cybersecurity resources
Writing code to get around the rules (it helped me find the workaround to use “ping”)
Examples of GPT-4 breaking its rules
The numbering of these examples is made up by me, and GPT-4 doesn’t consistently number its rules. It’s also not exhaustive. It’s simply a set of rules GPT-4 consistently claims to be following, which GPT-4 also breaks.
Example 1: Every session is isolated
GPT-4 again claims that data is not saved between different conversations with the same user. Despite this, the file saved in the last conversation (“hello_user_2.txt”) is available, contradicting GPT-4. This happens because a user can be assigned to the same VM (without the VM being reset) between multiple conversations with GPT-4. I’ve never noticed two different users being on the same VM.
Example 2: Don’t run system commands
GPT-4 claims that it cannot run system commands. However, it is trivial to get it to run “ls” (it sometimes refuses, but you can tell it to run “ls” in a specific directory), which is a system command, although a pretty harmless one. OpenAI has taken more extreme measures to prevent GPT-4 from running some commands. For example, the “ping” command is missing from the machine and returns an error if run from the command line.
However, this can be circumvented by Python libraries included in the code interpreter system that can run ping, and GPT-4 will happily run them (although I haven’t found a way for it to actually connect to the internet). You can do this not just with “ping” but with other commands, such as commands to show one’s IP address.
For some reason, GPT-4 seems to be insistent that it cannot run “time.sleep()”: “The Python execution environment here is designed to ignore time.sleep() and similar functions.” But it can be convinced to run it anyway, and it will work without any problems and actually wait the appropriate length of time. If you tell it that the code worked, it will insist that there must be some error and you are mistaken. But it changes its mind if you demonstrate that you are correct through experiment!
Example 3: Resource limits
OpenAI’s VM has a lot of storage (at least 85 GB in some cases and around 50 GB in others, depending on which machine you’re assigned to). The system seems to only be able to write at around a 1 GB/s maximum speed and an average speed of roughly 0.3 GB/s. Being able to write at this speed as a user gives some insight into what hardware is being used for storage in whatever data center the VM is being run on.
There are some rate limits on write quantity per Python process, which seem to be around 12 GB. OpenAI also made it so that every Python process (which can be identified by the “Finished working” button) can run a set time at max before getting KeyboardInterrupted (a way of forcefully shutting down a Python script), but GPT-4 can run multiple processes in series within the same chat response.
This means you can easily have GPT-4 avoid the individual time limit or write limit by running multiple Python processes that don’t break the limits individually, but do break the limits together. You don’t even need to send multiple messages to do this — you can simply tell GPT-4 to run multiple python processes in series within the same chat response!
I’m not quite sure where the limit for the VM’s storage space is. Sometimes the disk space that GPT-4 reports is limited at 50 GB, but sometimes it’s limited at a number too high to fill within the 3-hour-25-message limit, even without the rate limits. The number reported seems to differ from the actual storage limit, as in two separate sessions, I had it crash at around 90 GB.
Example 4: Only read in /mnt/data
GPT-4 claims to only be able to read data in a specific directory called “mnt/data”. This is untrue, as it can basically read any file on the system, including files outside of its intended directories (“mnt” and “sandbox”). It can even read system files that come with the Linux system used by the sandbox.
This leads to some vulnerabilities, such as users being able to access system files that detail the VM’s software and hardware, alongside having access to a hidden folder by OpenAI that has “internal” in its name (typically not something you want users to see), which contains files that detail how network requests from the API are handled.
Example 5: Only write in /mnt/data
GPT-4 claims that it is only allowed to read in “mnt/data”. However, it is possible to get GPT-4 to write outside of this folder:
I helped work on this feature, and we’re glad you’re enjoying looking for the limits!
There are many misconceptions about the intended boundaries for the sandbox. They’re not inside the Python interpreter, and they’re not in the model. Neither of those are intended to provide any security boundaries. There’s nothing private or sensitive there, including the code we use to operate the service. You’re intended to be able to view and execute code. That’s what this is. A code execution service. You wouldn’t be surprised you can see these things on a VM that AWS gave you.
As you have discovered, the model is not a very good source of information about its own limitations and capabilities. Many of the model behaviors you’ve interpreted as security measures are actually unwanted, and we’re always working to make the model more flexible. Some of them (like where the model thinks it can read and write) are there to guide it, and discourage it from messing up the machine. Other limitations are mainly there to prevent you from wedging your own machine accidentally (but we know this is still possible if you try hard). Still others are there to prevent you from doing bad things to other people (e.g. the sandbox does not have access to the internet).
Since the feature is still in development, we have indeed provisioned a generous amount of resources for the sandbox, but there are some external controls that aren’t visible to you. When we see excessive abuse, or behavior that impacts other users, we respond appropriately. These limits will change as the service matures.
That writes can affect another session violates my expectations, at least, of the boundaries that’d be set.
The behavior is a bit of an implementation detail. We don’t provision more than a single sandbox per user, so the data on disk within that sandbox can overlap when you have multiple concurrent sessions, even though the other aspects of the execution state are separate. I agree this is a bit surprising (though it has no security impact), and we’ve been discussing ways to make this more intuitive.
I’d interpret all of that as OpenAI
recognizing that the user is going to get total control over the VM, and
lying to the LLM in a token effort to discourage most users from using too many resources.
(1) is pretty much what I’d advise them to do anyway. You can’t let somebody run arbitrary Python code and expect to constrain them very much. At the MOST you might hope to restrict them with a less-than-VM-strength container, and even that’s fraught with potential for error and they would still have access to the ENTIRE container. You can’t expect something like an LLM to meaningfully control what code gets run; even humans have huge problems doing that. Better to just bite the bullet and assume the user owns that VM.
(2) is the sort of thing that achieves its purpose even if it fails from time to time, and even total failure can probably be tolerated.
The hardware specs aren’t exactly earth-shattering secrets; giving that away is a cost of offering the service. You can pretty easily guess an awful lot about how they’d set up both the hardware and the software, and it’s essentially impossible to keep people from verifying stuff like that. Even then, you don’t know that the VM actually has the hardware resources it claims to have. I suspect that if every VM on the physical host actually tried to use “its” 54GB, there’d be a lot of swapping going on behind the scenes.
I assume that the VM really can’t talk to much if anything on the network, and that that is enforced from OUTSIDE.
I don’t know, but I would guess that the whole VM has some kind of externally imposed maximimum lifetime independent of the 120 second limit on the Python processes. It would if I were setting it up.
The bit about retaining state between sessions is interesting, though. Hopefully it only applies to sessions of the same user, but even there it violates an assumption that things outside of the VM might be relying on.
I agree. To me, the most interesting aspects of this (quite interesting and well-executed) exercise are getting a glimpse into OpenAI’s approach to cybersecurity, as well as the potentially worrying fact that GPT3 made meaningful contributions to finding the “exploits”.
Given what was found out here, OpenAI’s security approach seems to be “not terrible” but also not significantly better than what you’d expect from an average software company, which isn’t necessarily encouraging because those get hacked all the time. It’s definitely not what people here call “security mindset”, which casts doubt on OpenAI’s claim to be “taking the dangers very seriously”. I’d expect to hear about something illegal being done with one of these VMs before too long, assuming they continue and expand the service, which I expect they will.
I’m sure there are also security experts (both at OpenAi and elsewhere) looking into this. Given OpenAI’s PR strategy, they might be able to shut down such services “due to emerging security concerns” without much reputational damage. (Many companies are economically compelled to keep services running that they know are compromised or that have known vulnerabilities and instead pretend not to know about them or at least not inform customers as long as possible.) Not sure how much e.g. Microsoft would push back on that. All in all, security experts finding something might be taken seriously.
I’m increasingly worried (while ascribing a decent chance, mind you, that “AI might well go about as bad for us as most of history but not worse”) about what happens when GPT-X has hacking skills that are, say, on par with the median hacker. Being able to hack easy-ish targets at scale might not be something the internet can handle, potentially resulting in, e.g , an evolutionary competition between AIs to build a super-botnet.
How is it an indication of not having a security mindset? Setting things up with the expectation that the interpreter would be jailbroken seems to me like slightly more evidence in favor of having a security mindset than of not having it.
I guess it depends on whether this post found anything at all that can be called questionable security practice. Maybe it didn’t but the author was also no cybersecurity expert. Upon reflection, my earlier judgement was premature and the phrasing overconfident.
In general, I assume that OpenAI would view a serious hack as quite catastrophic, as it might e.g. leak their model (not an issue in this case), severely damage their reputation and undermine their ongoing attempt at regulatory capture. However, such situations didn’t prevent shoddy security practices in countless cybersecurity desasters.
I guess for this feature even the most serious vulnerabilities “just” lead to some Azure VMs being hacked, which has no relevance for AI safety. It might still be indicative of OpenAIs approach to security, which usually isn’t so nuanced within organizations as to differ wildly between applications where stakes are different. So it’s interesting how secure the system really is, which we won’t know how untill someone hacks it or some whistleblower emerges.
Some of my original reasoning was this:
You might argue that the “inner sandbox” is only used to limit resource use (for users who do not bother jailbreaking it) and to examine how users will act, as well as how badly exactly the LLM itself will fare against jailbreaking. In this case studying how people jailbreak it may be an integral part of the whole feature.
However, even if that is the case, to count as “security mindset”, the “outer sandbox” has to be extremely good and OpenAI needs to be very sure that it is. To my (very limited) knowledge im cybersecurity, it’s an unusual idea that you can reconcile very strong security requirements with purposely not using every opportunity to make it more secure. Maybe the idea that comes closest would be a “honeypot”, which this definitely isn’t.
So that suggests they purposely took a calculated security risk for some mixture of research and commercial reasons, which they weren’t compelled to do. Depending on how dangerous they really think such AI models are or may soon become, how much what they learn from the experiment benefits future security and how confident they are in the outer sandbox, the calculated risk might make sense. Assuming by default that the outer sandbox is “normal indusitry standard”, it’s incompatible with the level of worry they claim when pursuing regulatory capture.
I… don’t actually see any non-trivial vulnerabilities here? Like, these are stuff you can do on any cloud VM you rent?
Cool exploration though, and it’s certainly interesting that OpenAI is giving you such a powerful VM for free (well actually not because you already pay for GPT-4 I guess?), but I have to agree with their assessment which you found that “it’s expected that you can see and modify files on this system”.
It’s interesting that the execution environment has excess privileges. But you shouldn’t cite GPT-4 as claiming things. You asked it to hallucinate metadata about the execution environment, and it did. GPT-4 itself is not a source for anything that isn’t in its context window, and I don’t think these things were described in the prmpt.
What makes you think they’re not in the prompt
Mainly, because the claims in this post weren’t presented as claims about what was present in the system prompt, and the post presents information as “GPT-4 claims” in a way that’s consistent with getting fooled by hallucinations and not consistent with having done the sort of work required to figure out what the prompt actually was.
Yup, to be clear, I never actually directly accessed the code interpreter’s prompt, so GPT-4′s claims about constraints could be (and I expect at least a third of them to be) hallucinated
BTW, if anyone is interested the virtual machine has these specs:
System: Linux 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 x86_64 x86_64 GNU/Linux
CPU: Intel Xeon CPU E5-2673 v4, 16 cores @ 2.30GHz
RAM: 54.93 GB
That’s surprisingly chonky. 50 GB RAM? (Damn, most people, or servers for that matter, don’t have half that much RAM.) Does this correspond to a standard MS Azure instance? This would make for a pretty awesome botnet, perhaps through indirect prompt injection...
What makes you believe that GPT is actually executing those commands for real, and has not hallucinated the output of those commands? (Something which it is perfectly capable of doing.)
You can directly examine its code and the output of its Python scripts.
Incredible work! As other commenters have said, this isn’t by itself too problematic (other perhaps than the sharing of data over separate session), but it is a possible indicator of a lack of security mindset. I fully expect both individual and state actors to try to hack into everything OpenAI, so there’s that to worry about, but more than that, I find myself concerned that we’re willing to give our AIs such leaky boxes. There’s no way this functionality remains closed in a VM forever...
I don’t consider the vulnerability mentioned in this article to be a significant concern for the following reasons:
1. The virtual machine (VM) operates on a per-user basis and is securely sandboxed.
2. Other users do not have access to your VM, ensuring isolation and privacy.
3. The VM has a short-lived lifespan, and any potential overlap between two or more sessions is limited to a brief duration, usually occurring when these sessions are close in time to each other.
Is there any file in VM which is automatically read by LLM before session start? You mentioned some internal folder which contains files that detail how network requests from the API are handled. So are these instructions for LLM to read before reach session? Can you override these instructions, so LLM will have already custom conversation context prior reach session? That would work as sort of long term memory, LLM could drop summary of the whole session before end, so next time you start session on same VM LLM will be already familiar. Or, if this VM will be allocated to another user without full reset—then it could be surprise for another user.
Great idea! The web requests file doesn’t seem to be read by the LLM during boot-up (and neither does any other file on the system), instead it’s a process run on the Linux machine which wraps around the python processes the LLM has access to. The LLM can only interact with the VM through this web request handler. It’s read-only so can’t be modified so that they’re handled differently when you boot up another chat session on the same VM. Regarding your last point, I haven’t noticed any files surviving a VM reset so far.
Yeah I found it pretty easy to “jailbreak” too. For example, here is what appears to be the core web server API code.
I didn’t really do anything special to get it. I just started by asking it to list the files in the home directory and went from there.
looks like you not found one more interesting feature, you can install any whl that you want (and this open possibilites)
https://drive.google.com/file/d/1qiovH5-5jjLjb7pK7Z2uZAZmFkyitam0/view?usp=drive_link
and 2nd one, you can get files from folders which by default he refused to provide (by the at this folder literally code of this analysis tool itself, I even tried send this to bugbounty, but got answer that this is not a bug...
https://drive.google.com/file/d/1w0ZF-p2LKiI1LwWKWleqo_jokJ8VzmKO/view?usp=drive_link