Competitive programming with AlphaCode

AlgonFeb 2, 2022, 4:49 PM

58 points

Note: I’m totally going to alter the article based on the comments. Though of course I’ll give credit where it is due.

Edit: Apparently there is a post by one of the authors going in depth on the paper. Here you go if you don’t want to read this. He gives more details than the blog, and gives a neat example of one of the questions AlphaCode solves. But he doesn’t give a lot of details on what AlphaCode actually is/

Predict the headline result. Go on, I dare you.

Deepmind released a new Transformer model which produced code in response to questions from contests on Codeforce. They simulated participation in 10 recent contests, and produced less than 10 potential solutions for each problem.

Any predictions on how likely people thought this was? Submit your answers below^[1].

Would you predict AlphaCode that AlphaCode can outperform the lower quartile of programmers on CodeForce's competitive programming contests?

What links here?

AlgonFeb 2, 2022, 4:49 PM

58 points

36 comments15 min readLW link

AI Capabilities AI Programming DeepMind

Daniel Kokotajlo Feb 2, 2022, 8:53 PM
17 points

Advanced AI risks. Longer term, code generation could lead to advanced AI risks. Coding capabilities could lead to systems that can recursively write and improve themselves, rapidly leading to more and more advanced systems.
Cool to see this acknowledged in the paper!
- Algon Feb 3, 2022, 4:23 AM
  3 points
  Parent
  
  I guess I would have liked more discussion on how this impacted Deepmind’s views on the topic. Could we get a discussion started on that in the comment section? I’ve added some thoughts to the article on the topic, but I haven’t sat down and done explicit calculations for how I should update yet, or how this ties into stuff like Johnwentworth’s research agenda, or how much more likely we are to be in Paul-verse or even if Paul and Eliezer would agree that this result favours Paul-verse.
  - UHMWPE-UwU Feb 3, 2022, 5:02 PM
    21 points
    Parent
    
    In which way does this news “favour Paul-verse”?
    - Algon Feb 3, 2022, 5:09 PM
      1 point
      Parent
      
      Tons of small improvements can make big changes to performance, in a way that I did not think they would be able to. Or at least, not so soon. If there was just one change they made and it resulted in the same performance increase, I’d say that is way more Eliezer verse.
      - Daniel Kokotajlo Feb 3, 2022, 10:09 PM
        12 points
        Parent
        
        Huh. Doesn’t this also push towards faster progress and thus faster takeoff though, and thus Eliezer-verse?
        Algon Feb 3, 2022, 10:19 PM
        1 point
        Parent
        
        Paul seems down with progress being potentially quite fast. I’d be willing to bet he thinks a takeoff in 15 years is on the cusp of plausibility. I agree that the speed is maybe more Eliezer verse, but how it happened is more Paulverse. So I’d guess l(AlphaCode|Paulverse)>l(AlphaCode|Eliezerverse). But I don’t understand Eliezerverse enough to be confident about the numerical value of the likelihoods.
        Daniel Kokotajlo Feb 4, 2022, 3:02 AM
        15 points
        Parent
        
        Idk. At this point I want to taboo “Paul-verse” and “Eliezer-verse” and use more specific, descriptive terms instead.
Derek M. Jones Feb 2, 2022, 5:57 PM
6 points

My reading of Appendix A is that the group did its own judging, i.e., did not submit answers to Codeforces.
They generated lots of human verified test data, but then human implementors would do something similar.
They trained on Github code, plus solutions code on Codeforces. Did they train on Codeforces solutions code that solved any of the problems? Without delving much deeper into the work, I cannot say. They do call out the fact that the solutions did not include chunks of copy-pasted code.
To what extent are the successes presented representative of the problems tried? That is, did they try to solve lots of problems and we are seeing the cases that worked well? The fact that they were able to get solutions to some problems was impressive.
The solved problems had short solutions. How well does the technique scale to problems requiring more code for their solution? I suspect it doesn’t, but then there are applications where the solutions are often short.
Jacob Pfau Feb 2, 2022, 8:02 PM
5 points

It’s worth noting that Table 7 shows Github pre-training outperforming MassiveText (natural language corpus) pre-training. The AlphaCode dataset is 715GB compared to the 10TB of MassiveText (which includes 3TB of Github). I have not read the full details of both cleaning processes, but I assume that the cleaning / de-duplication process is more thorough in the case of the AlphaCode Github only dataset. EDIT: see also Algon’s comment on this below.

I know of a few EAs who thought that natural language pre-training will continue to provide relevant performance increases for coding as training scales up over the next few years, and I see this as strong evidence against that claim. One remaining question might be whether this finding is an artefact of code dataset growth temporarily accelerating relative to NL dataset size growth. Insofar as dataset sizes are constrained by compute rather than absolute data availability, I think we should expect code dataset sizes to approach NL dataset sizes. Also cf. my recent Metaculus question on future prospects for NL-pretraining to programming transfer.
- gwern Feb 2, 2022, 8:32 PM
  15 points
  Parent
  
  
  I know of a few EAs who thought that natural language pre-training will continue to provide relevant performance increases for coding as training scales up over the next few years, and I see this as strong evidence against that claim.
  
  I think that was largely settled by the earlier work on transfer scaling laws and Bayesian hierarchical interpretations: pretraining provides an informative prior which increases sample-efficiency in a related task, providing in essence a fixed n sample gain. But enough data washes out the prior, whether informative or uniform. So if you have enough data and compute (which you usually don’t), transfer results in the same final performance. This is also true in, say, image classification. Stuff like CLIP is great for transfer learning—unless you have millions of labeled images in your target domain, in which case, yeah sure it’ll probably just match the from-scratch model. (How else could things possibly work?) 715GB of Github is definitely large enough that it washes out the natural language prior! But as the Copilot paper also points out, you still get a big benefit in that you can train a lot less when you start with GPT-3 as the prior. Nothing to sneeze at, and I suspect DM would’ve gotten even better results here if they had started with Gopher rather than their from-scratch dataset as their compute budget would stretch much further and they could do much less brute-force rejection sampling of program candidates.
  
  It’s also direction-specific. Natural language may wash out when you target Github… but does Github wash out when you target natural language? I will be curious to see if anyone tries using the coding models as the prior for language modeling instead of vice-versa, and if that leads to noticeable gains on the various reasoning-esque benchmarks.
  
  Table 7 shows that pre-training on a natural language corpus slightly degrades performance compared to pre-training on Github.
  
  But it does not show that pretraining on natural language plus Github is worse than Github-only. This is also what you’d expect from Copilot showing that GPT-3 (natural language) initialization trains much faster to the same performance level.
  - Jacob Pfau Feb 2, 2022, 9:02 PM
    4 points
    Parent
    
    I agree that the scaling laws for transfer paper already strongly suggested that pre-training would eventually not provide much in terms of performance gain. I remember doing a back-of-the-envelope for whether 2025 would still use pre-training (and finding it wouldn’t improve performance), but I certainly didn’t expect us to reach this point in early 2022. I also had some small, but significant uncertainty regarding how well the scaling laws result would hold up when switching dataset+model+modelsize, and so the AlphaCode data point is useful in that regard as well.
    
    As for the point on accelerating training, this makes intuitive sense to me, but it’s not clear to me how relevant this is? Figure 7 of Laws for Transfer shows that the compute needed to plateau on their largest models with and without pre-training looks to be within an OOM?
    - gwern Feb 2, 2022, 9:24 PM
      6 points
      Parent
      
      An OOM is nothing to sneeze at, especially when you can get it for free by training an off-the-shelf pretrained model (DM already trained a Gopher, it doesn’t cost any more to reuse!) exactly as you would otherwise, no compromises or deadends like MoEs. Note that AlphaCode didn’t have the compute budget to do its approach optimally.
      - Daniel Kokotajlo Feb 2, 2022, 10:52 PM
        2 points
        Parent
        
        Why didn’t they use Gopher then for AlphaCode? Maybe Gopher wasn’t done training yet?
        gwern Feb 3, 2022, 1:44 AM
        17 points
        Parent
        
        Possibly but the timelines don’t quite seem to line up. On Twitter, DMers are describing this as a 2-year project, implying AlphaCode started ~February 2020. GPT-3 wouldn’t come out until May 2020 and obviously Codex/Copilot didn’t come out until mid-2021, but there were already Transformer for code generation (even assistance ones like TabNine) and so this is pretty much the obvious way to keep going and ‘2 years’ is entirely plausible as the timespan. Now, Gopher is described as starting (or was it finishing?) training in December 2020, so it became available about half-way through: they had all of 2021 & January 2022 to drop Gopher into their training & evaluation framework. I know there’s always a lot of inertia and everything always takes longer than outsiders predict on projects this complicated (look at the acknowledgements and staff section)… but I think that’s probably enough time that they could have used Gopher if they had really wanted to, unless this project was very frontloaded and mostly done by the time Gopher came around and they spent most of 2021 doing stuff like writing it up or evaluating it?
        
        It seems equally plausible to me that they ran out of their allotted compute by the time Gopher came around and that even if it would be on net more efficient to train a Gopher, they had already spent their quota. DM doesn’t have an infinite budget and can’t pursue everything to the logical endpoint (like how AlphaStar was ignominously dropped right around where it had added harder APM limits / was using the camera like a human / training on all races+maps, but was only human-pro-level and hadn’t AlphaGo’d humans yet).
      - Jacob Pfau Feb 2, 2022, 9:57 PM
        2 points
        Parent
        
        Yes, I agree certainly at 2025 training run prices, saving 2-5x on a compute run will be done whenever possible. For this reason, I’d like to see more predictions on my Metaculus question!
- Algon Feb 2, 2022, 8:23 PM
  4 points
  Parent
  
  They chuck out larger files from Github (1MB+), or with lines longer than 1000 characters to exclude automatically generated code. I’m guessing the former is because programs that’s too long just aren’t useful when your model’s context windows are tiny in comparison. They did also get rid of duplicates. Plus, they needed to avoid code published after the questions in the dataset were made, to avoid leaking the answer.
  As to natural language training, I suppose I’d agree that it is some evidence against the claim. But I’d say it would be strong evidence if they also trained Alphafold on MassiveText and found little to no performance increase. I wouldn’t be surprised if it didn’t do much though.
  Edit: Oh, I just saw Table 7. Yeah, that’s pretty strong evidence against the claim that natural language corpuses are anywhere near as useful as code corpuses for this kind of stuff. But I am a little surpised that it added as much as it did.
Daniel Kokotajlo Feb 2, 2022, 7:41 PM
4 points

Thanks, I’m grateful to you for reading the paper more closely than I did and reporting back interesting findings!
Their best model solved 34.2% on newly made test questions in their dataset, as compared to the prior state of the art of 1-5% on existing datasets.
Can you elaborate on this? Was the prior state of the art Codex? It sounds like you are saying Codex etc. solved 1-5% of various coding question datasets, and AlphaCode 41B solves 34.2% of their new coding question dataset, which is a huge leap forward if the datasets are comparably difficult (and in fact arguably the new dataset is more difficult?). Is this what you are saying? Where did you see that in the paper? I only skimmed it but the best I found was Figure 10 which didn’t quite contain the info I wanted.
- Algon Feb 2, 2022, 8:08 PM
  3 points
  Parent
  
  See the second paragraph from the bottom of the paper. What they’re saying is that their model solves 34.2% of questions on the dataset they built for the model. Old models couldn’t have been tested on this, as they predate it. Instead, they were tested on tje APPS benchmark (amongst others). If you want to compare it to old models in a fair way, you need to test it on APPS. Which they did, though owing to dataset differences they couldn’t use certain parts of their methods. That’s what you see in Table 10.
  I’ll try to cover this in the post, but that’s probably going to be a couple of hours.
A Ray Feb 3, 2022, 12:22 AM
3 points

Should this other post be a separate linkpost for this? https://www.furidamu.org/blog/2022/02/02/competitive-programming-with-alphacode/#fnref:2

Feels like it covers the same, but is a personal description by an author, rather than the deepmind presser.
- Algon Feb 3, 2022, 12:32 AM
  2 points
  Parent
  
  Thanks for mentioning that. I didn’t know it existed. But I’m not sure what you mean by “a seperate linkpost”? Anyway, I’ll link to it at the beginning of the post.
  - A Ray Feb 3, 2022, 2:10 AM
    1 point
    Parent
    
    Oh great, thanks. I think I was just asking folks if they thought it should be discussed separately (since it is a different piece) or together with this one (since they’re describing the same research).
    - Algon Feb 3, 2022, 4:25 AM
      2 points
      Parent
      
      I don’t think it is worth spinning up a new article on it, but I’m a karma-junkie so my opinion is a little biased. More seriously, he doesn’t say much that the blog post doesn’t, or that you can’t see by going on the link to examples of AlphaCode’s code. But it is probably better presented.
superads91 Feb 3, 2022, 1:28 PM
2 points

How does this affect timelines? Does this make the prospect of AGI a lot nearer? I’m sorry, I’m just a lay person, but this has got me more scared than anything else. So now AI can finally efficiently build itself?
- Algon Feb 3, 2022, 2:21 PM
  3 points
  Parent
  
  It shortens timelines if you were surprised by how impressive this was, and how few impressive insights seem required to give big performance increases. But the how big an impact this has depends on your priors.
  To me, this thing isn’t that much more impressive than Codex in the kinds of problems it solved. It can perform some simple algorithms if you give it enough tries to a simple problem, but that is a long way away from being able to modify the kind of code it is made up of. Or even just producing anywhere near peak human insights into algorithms.
  Now you might wonder if that is just a result of the model being too small, the dataset too tiny or whatever. But I don’t think so. The changes they introduced are not going to make efficient use of compute and it seems like it is probably at the point where it won’t improve with much more data. Doubling all the models resources and size is not going to get you double, or probably even a 50% increase in results. So I’m not that much more worried i.e. my timelines didn’t just get cut in half by this.
  - superads91 Feb 3, 2022, 5:13 PM
    1 point
    Parent
    
    Nice. In what measure would you say it affected your timelines, like a rough approximation? And what would your timelines be at this point? I know this is hard to tell so no offense if you don’t wanna answer, but I’m indeed curious.
    - Algon Feb 3, 2022, 5:18 PM
      1 point
      Parent
      
      I don’t really know. I guess I’d say I feel like it might have shortened timelines by a few months to a year, but I think I had longer time horizons than most people on Lesswrong. I know that’s not that helpful, but I don’t think I have an explicit date of when I think catastrophe will arrive.
  - maximkazhenkov Feb 3, 2022, 2:34 PM
    1 point
    Parent
    
    What do you mean by “give it enough tries”?
    - Algon Feb 3, 2022, 2:56 PM
      1 point
      Parent
      
      That is the sampling part of AlphaGo. When you ask it a question, if you let take more and more samples as potential solutions, the more likely you are to get a correct answer. The success rate of this seems to grow log linearly with the number of samples, which is pretty poor.
      - maximkazhenkov Feb 3, 2022, 7:06 PM
        1 point
        Parent
        
        Oh, I see, I thought you meant AlphaCode had some sort of recurrency (does it?), i.e. iterating over its answer like a human would when debugging code. Log-linear doesn’t sound bad, you wouldn’t expect a human to be able to write a correct program on the first try without debugging either, would you?
        Algon Feb 3, 2022, 7:21 PM
        2 points
        Parent
        
        It is pretty bad because the constant matters here. If you have a method which achieves a success rate % of C log( 1+ k) where k is the number of samples then, if C =1, you’d need 2^100~10^30 many samples per inference to get all the competitive programing questions right. As it stands, C seems pretty small. For k=100000 you get solve maybe 33% of problems. And it isn’t much better than k=1000.
        That’s leaving aside the question of whether you will continue to get log-linear improvements, which I find fairly unlikely. Though if you could get C to something like 10 or 20 with one or two innovations, I’d be scared or even terrified.
        Daniel Kokotajlo Feb 3, 2022, 10:14 PM
        2 points
        Parent
        
        To check my understanding:
        If C was 20, then 100,000 samples would be enough to solve approximately 100% of these problems?
        But instead C is something like 6-7, because 100,000 samples gets you to 33%?
        Algon Feb 3, 2022, 10:28 PM
        3 points
        Parent
        
        Yeah, basically, but I was using log_2(x) instead of log_10(x) in my made up example. Here’s some actual data. What they mean by “unlimited attempts” is that there is no filter on the sample solutions, so they are all submitted to codebase. I expect false positives/incredibly slow to be more likely than performance actually increasing without limit.
Measure Feb 2, 2022, 9:26 PM
1 point


(3% was SOTA on APPS, AlphaFold got 7%)

AlphaFold does about twice as well as Codex 12B

Did you mix up AlphaFold and AlphaCode?
- Algon Feb 2, 2022, 11:27 PM
  1 point
  Parent
  
  Yes, that happened a couple of times as I was writing things. Thanks.
Bezzi Feb 6, 2022, 10:02 AM
0 points

they ranked within the top 54% of the contestants.
If we limit the ratings to users who have participated in at least 1 contest in the last 6 months, AlphaCode is ranked (predict it first!) in the top 28%.
I didn’t get how is this comparison supposed to be meaningful. The main constraint for human programmers in code contests is the time limit. I mean, the last challenge on Codeforces gives you 2 hours and 30 minutes to solve 6 problems, which is not exactly a ton of time. You need to be a very good programmer to completely solve all the problems within the deadline, but any decent programmer should be able to solve the same problems within a week. I would claim that having more time to write dramatically improves your chance of producing correct code, in a way that having more time to make a chess move does not.
We can argue that they try to model this with the limit of 10 submissions per problem, but following the link I read this:
Removing the limit of 10 submissions can increase the solve rate further, reaching 49.7th percentile with an average of 29 submissions per solved problem.
It seems that the analogue of “give it plenty of time to write” makes the rank shift from top 54% to top 49,7%. Which is… not incredibly impressive? Did I forgot to read some details that would make the comparison more meaningful?
- Algon Mar 10, 2022, 7:50 PM
  1 point
  Parent
  
  The 10 submissions per problem is meaningful, I think. I can see myself making 10 submissions to check my code against test cases. But I agree that the peak performance can require a lot more submissions, and make it incomparable to humans in this context. But you know, if you eliminate the time limit for some of these problems, I can see some people never getting the solution. So I don’t think this is too different. But I do expect the AI to perform worse in the no time limit case.

Competitive programming with AlphaCode

Predict the headline result. Go on, I dare you.

They made a new dataset because the old one had too many false positives?

What IS Alphacode Though?

Model Time

Wait, what about the other stuff in the graph?

Code Samples and Results

I’m Confused About How Tags Help Code

AlphaGo vs Older Models and Weird Results

Alignment

PostScript