(Disclaimer: I work at OpenAI, and I worked on the models/research behind copilot. You should probably model me as a biased party)
This will probably make the already-bad computer security/infosec situation significantly worse.
I’ll take the other side to that bet (the null hypothesis), provided the “significantly” unpacks to something reasonable. I’ll possibly even pay to hire the contractors to run the experiment.
I think a lot of people make a lot of claims about new tech that will have a significant impact that end up falling flat. A new browser will revolutionize this or that; a new website programming library will make apps significantly easier, etc etc.
I think a good case in point is TypeScript. JavaScript is the most common language on the internet. TypeScript adds strong typing (and all sorts of other strong guarantees) and has been around for a while. However I would not say that TypeScript has significantly impacted the security/infosec situation.
I think my prediction is that Copilot does not significantly affect the computer security/infosec situation.
It’s worth separating out that this line of research—in particular training large language models on code data—probably has a lot more possible avenues of impact than a code completer in VS Code. My prediction is not about the sum of all large language models trained on code data.
I also do think we agree that it would be good if models always produced the code-we-didnt-even-know-we-wanted, but for now I’m a little bit wary of models that can do things like optimize code outside of our ability to notice/perceive.
Are you saying that (i) few people will use copilot, or (ii) many people will use copilot but it will have little effect on their outputs or (iii) many people will use copilot and it will boost their productivity a lot but will have little effect on infosec? Your examples sound more like supporting i or ii than supporting iii, but maybe I’m misinterpreting.
I think all of those points are evidence that updates me in the direction of the null hypothesis, but I don’t think any of them is true to the exclusion of the others.
I think a moderate amount of people will use copilot. Cost, privacy, and internet connection will factor to limit this.
I think copilot will have a moderate affect on users outputs. I think it’s the best new programming tool I’ve used in the past year, but I’m not sure I’d trade it for, e.g. interactive debugging (reference example of a very useful programming tool)
I think copilot will have no significant differential effect on infosec, at least at first. The same way I think the null hypothesis should be a language model produces average language, I think the null hypothesis is a code model produces average code (average here meaning it doesn’t improve or worsen the infosec situation that jim is pointing to).
In general these lead me to putting a lot of weight on ‘no significant impact’ in aggregate, though I think it is difficult for anything to have a significant impact on the state of computer security.
(Some examples come to mind: Snowden leaks (almost definitely), Let’sEncrypt (maybe), HTTPSEverywhere (maybe), Domain Authentication (maybe))
1. jim originally said that copilot produces code with vulnerability, which, if used extensively, could generate loads of vulnerabilities, giving more opportunities for exploits overall. jim mentions it worsening “significantly” infosec
2. alex responds that given that the model tries to produce the code it was trained on, it will (by def.) produce average level code (with average level of vulnerability), so it won’t change the situation “significantly” as the % of vulnerabilities per line of code produced (in the world) won’t change much
3. vanessa asks if the absence of change from copilot results from a) lack of use b) lack of change in speed/vulnerability code production from using (ie. used as some fun help but without strong influence on the safety on the code as people would still be rigorous) c) change in speed/productivity, but not in the % of vulnerability
4. alex answers that indeed it makes users more productive and it helps him a lot, but that doesn’t affect overall infosec in terms of % of vulnerability (same argument as 2). He nuances his claim a bit saying that a) it would moderatly affect outputs b) some stuff like cost will limit how much it affect those c) it won’t change substantiallyat first (conjunction of two conditions).
What I think is the implicit debate
i) I think jim kind of implicitly assume that whenever someone writes code by himself, he would be forced to have good habits for security etc., and that whenever the code is automatically generated then people won’t use their “security” muscles that much & assume the AI produced clean work… which apparently (given the examples from jim) does not by default. Like a Tesla not being safe enough at self-driving.
ii) I think what’s missing from the debate is that the overall “infosec level” depends heavily on what a few key actors decide to do, those being in charge of safety-critical codebases for society-level tools (like nukes). So one argument could be that, although the masses might be more productive for prototyping etc., the actual infosec people might just still be as careful / not use it, so the overall important infosec won’t change, and thus the overall infosec won’t change.
iii) I think vanessa point kind of re-states i) and disagrees with ii) by saying that everyone will use this anyway? Because by definition if it’s useful it will change their code/habits, otherwise it’s not useful?
iv) I guess alex’s implicit points are that code generation with Language Models producing average human code was going to happen anyway & that saying it is a significant change is an overstatement, & we should probably just assume no drastic change in %vulnerability distribution at least for now.
I think jim kind of implicitly assume that whenever someone writes code by himself, he would be forced to have good habits for security etc.,
This part I think is not quite right. The counterfactual jim gives for Copilot isn’t manual programming, it’s StackOverflow. The argument is then: right now StackOverflow has better methods for promoting secure code than Copilot does, so Copilot will make the security situation worse insofar as it displaces SO.
I think my prediction is that Copilot does not significantly affect the computer security/infosec situation.
This is my prediction too, but there are two strands to the argument that I think are worth teasing apart:
First, how many people will use Copilot? The base rate for infosec impact of innovations is very low, because most innovations are taken up slowly or not at all. Typescript is typical: most people who could use Typescript use Javascript instead (see for example the TIOBE rankings), so even if Typescript prevents all security problems it can’t impact the overall security situation much. Garbage collection is another classic example: it was in production systems in the late 60s, but didn’t become mainstream until the 90s with the rise of Java and Perl. There was a span of 20+ years where GC didn’t much affect the use-after-free landscape, even though GC prevents 100% of use-after-free bugs.
(counterpoint: StackOverflow was also an innovation, it was taken up right away, and Copilot is more like StackOverflow than it is like a traditional technical innovation. I don’t really buy this because Copilot seems it’ll be much harder to get started with even once it’s out of beta)
Second, are users of Copilot more or less likely to write security bugs? Here my prediction points the other way: Copilot does generate security bugs, and users are unusually unlikely to catch them because they’ll tend to use it in domains they’re unfamiliar with. Somewhat more weakly I think it’ll be worse than the counterfactual where they don’t have Copilot and have to use something else, for the reasons jimrandomh lists.
I’m curious whether you see the breakdown the same way, and if so, how you see the impact of Copilot conditional on its being widely adopted.
(Disclaimer: I work at OpenAI, and I worked on the models/research behind copilot. You should probably model me as a biased party)
I’ll take the other side to that bet (the null hypothesis), provided the “significantly” unpacks to something reasonable. I’ll possibly even pay to hire the contractors to run the experiment.
I think a lot of people make a lot of claims about new tech that will have a significant impact that end up falling flat. A new browser will revolutionize this or that; a new website programming library will make apps significantly easier, etc etc.
I think a good case in point is TypeScript. JavaScript is the most common language on the internet. TypeScript adds strong typing (and all sorts of other strong guarantees) and has been around for a while. However I would not say that TypeScript has significantly impacted the security/infosec situation.
I think my prediction is that Copilot does not significantly affect the computer security/infosec situation.
It’s worth separating out that this line of research—in particular training large language models on code data—probably has a lot more possible avenues of impact than a code completer in VS Code. My prediction is not about the sum of all large language models trained on code data.
I also do think we agree that it would be good if models always produced the code-we-didnt-even-know-we-wanted, but for now I’m a little bit wary of models that can do things like optimize code outside of our ability to notice/perceive.
Are you saying that (i) few people will use copilot, or (ii) many people will use copilot but it will have little effect on their outputs or (iii) many people will use copilot and it will boost their productivity a lot but will have little effect on infosec? Your examples sound more like supporting i or ii than supporting iii, but maybe I’m misinterpreting.
I think all of those points are evidence that updates me in the direction of the null hypothesis, but I don’t think any of them is true to the exclusion of the others.
I think a moderate amount of people will use copilot. Cost, privacy, and internet connection will factor to limit this.
I think copilot will have a moderate affect on users outputs. I think it’s the best new programming tool I’ve used in the past year, but I’m not sure I’d trade it for, e.g. interactive debugging (reference example of a very useful programming tool)
I think copilot will have no significant differential effect on infosec, at least at first. The same way I think the null hypothesis should be a language model produces average language, I think the null hypothesis is a code model produces average code (average here meaning it doesn’t improve or worsen the infosec situation that jim is pointing to).
In general these lead me to putting a lot of weight on ‘no significant impact’ in aggregate, though I think it is difficult for anything to have a significant impact on the state of computer security.
(Some examples come to mind: Snowden leaks (almost definitely), Let’sEncrypt (maybe), HTTPSEverywhere (maybe), Domain Authentication (maybe))
Summary of the debate
1. jim originally said that copilot produces code with vulnerability, which, if used extensively, could generate loads of vulnerabilities, giving more opportunities for exploits overall. jim mentions it worsening “significantly” infosec
2. alex responds that given that the model tries to produce the code it was trained on, it will (by def.) produce average level code (with average level of vulnerability), so it won’t change the situation “significantly” as the % of vulnerabilities per line of code produced (in the world) won’t change much
3. vanessa asks if the absence of change from copilot results from a) lack of use b) lack of change in speed/vulnerability code production from using (ie. used as some fun help but without strong influence on the safety on the code as people would still be rigorous) c) change in speed/productivity, but not in the % of vulnerability
4. alex answers that indeed it makes users more productive and it helps him a lot, but that doesn’t affect overall infosec in terms of % of vulnerability (same argument as 2). He nuances his claim a bit saying that a) it would moderatly affect outputs b) some stuff like cost will limit how much it affect those c) it won’t change substantially at first (conjunction of two conditions).
What I think is the implicit debate
i) I think jim kind of implicitly assume that whenever someone writes code by himself, he would be forced to have good habits for security etc., and that whenever the code is automatically generated then people won’t use their “security” muscles that much & assume the AI produced clean work… which apparently (given the examples from jim) does not by default. Like a Tesla not being safe enough at self-driving.
ii) I think what’s missing from the debate is that the overall “infosec level” depends heavily on what a few key actors decide to do, those being in charge of safety-critical codebases for society-level tools (like nukes). So one argument could be that, although the masses might be more productive for prototyping etc., the actual infosec people might just still be as careful / not use it, so the overall important infosec won’t change, and thus the overall infosec won’t change.
iii) I think vanessa point kind of re-states i) and disagrees with ii) by saying that everyone will use this anyway? Because by definition if it’s useful it will change their code/habits, otherwise it’s not useful?
iv) I guess alex’s implicit points are that code generation with Language Models producing average human code was going to happen anyway & that saying it is a significant change is an overstatement, & we should probably just assume no drastic change in %vulnerability distribution at least for now.
This part I think is not quite right. The counterfactual jim gives for Copilot isn’t manual programming, it’s StackOverflow. The argument is then: right now StackOverflow has better methods for promoting secure code than Copilot does, so Copilot will make the security situation worse insofar as it displaces SO.
This is my prediction too, but there are two strands to the argument that I think are worth teasing apart:
First, how many people will use Copilot? The base rate for infosec impact of innovations is very low, because most innovations are taken up slowly or not at all. Typescript is typical: most people who could use Typescript use Javascript instead (see for example the TIOBE rankings), so even if Typescript prevents all security problems it can’t impact the overall security situation much. Garbage collection is another classic example: it was in production systems in the late 60s, but didn’t become mainstream until the 90s with the rise of Java and Perl. There was a span of 20+ years where GC didn’t much affect the use-after-free landscape, even though GC prevents 100% of use-after-free bugs.
(counterpoint: StackOverflow was also an innovation, it was taken up right away, and Copilot is more like StackOverflow than it is like a traditional technical innovation. I don’t really buy this because Copilot seems it’ll be much harder to get started with even once it’s out of beta)
Second, are users of Copilot more or less likely to write security bugs? Here my prediction points the other way: Copilot does generate security bugs, and users are unusually unlikely to catch them because they’ll tend to use it in domains they’re unfamiliar with. Somewhat more weakly I think it’ll be worse than the counterfactual where they don’t have Copilot and have to use something else, for the reasons jimrandomh lists.
I’m curious whether you see the breakdown the same way, and if so, how you see the impact of Copilot conditional on its being widely adopted.