DeepMind claims its new code-generating system is competitive with human programmers

  • 🏰 The Fediverse is up. If you know, you know.
  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account

Last year, San Francisco-based research lab OpenAI released Codex, an AI model for translating natural language commands into app code. The model, which powers GitHub’s Copilot feature, was heralded at the time as one of the most powerful examples of machine programming, the category of tools that automates the development and maintenance of software.

Not to be outdone, DeepMind — the AI lab backed by Google parent company Alphabet — claims to have improved upon Codex in key areas with AlphaCode, a system that can write “competition-level” code. In programming competitions hosted on Codeforces, a platform for programming contests, DeepMind claims that AlphaCode achieved an average ranking within the top 54.3% across 10 recent contests with more than 5,000 participants each.


DeepMind principal research scientist Oriol Vinyals says it’s the first time that a computer system has achieved such a competitive level in all programming competitions. “AlphaCode [can] read the natural language descriptions of an algorithmic problem and produce code that not only compiles, but is correct,” he added in a statement. “[It] indicates that there is still work to do to achieve the level of the highest performers, and advance the problem-solving capabilities of our AI systems. We hope this benchmark will lead to further innovations in problem-solving and code generation.”

Learning to code with AI​

Machine programming been supercharged by AI over the past several months. During its Build developer conference in May 2021, Microsoft detailed a new feature in Power Apps that taps OpenAI’s GPT-3 language model to assist people in choosing formulas. Intel’s ControlFlag can autonomously detect errors in code. And Facebook’s TransCoder converts code from one programming language into another.

The applications are vast in scope — explaining why there’s a rush to create such systems. According to a study from the University of Cambridge, at least half of developers’ efforts are spent debugging, which costs the software industry an estimated $312 billion per year. AI-powered code suggestion and review tools promise to cut development costs while allowing coders to focus on creative, less repetitive tasks — assuming the systems work as advertised.

Like Codex, AlphaCode — the largest version of which contains 41.4 billion parameters, roughly quadruple the size of Codex — was trained on a snapshot of public repositories on GitHub in the programming languages C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript. AlphaCode’s training dataset was 715.1GB — about the same size as Codex’s, which OpenAI estimated to be “over 600GB.”

In machine learning, parameters are the part of the model that’s learned from historical training data. Generally speaking, the correlation between the number of parameters and sophistication has held up remarkably well.

Architecturally, AlphaCode is what’s known a Transformer-based language model — similar to Salesforce’s code-generating CodeT5. The Transformer architecture is made up of two core components: an encoder and a decoder. The encoder contains layers that process input data, like text and images, iteratively layer by layer. Each encoder layer generates encodings with information about which parts of the inputs are relevant to each other. They then pass these encodings to the next layer before reaching the final encoder layer.

Creating a new benchmark​

Transformers typically undergo semi-supervised learning that involves unsupervised pretraining, followed by supervised fine-tuning. Residing between supervised and unsupervised learning, semi-supervised learning accepts data that’s partially labeled or where the majority of the data lacks labels. In this case, Transformers are first subjected to “unknown” data for which no previously defined labels exist. During the fine-tuning process, Transformers train on labeled datasets so they learn to accomplish particular tasks like answering questions, analyzing sentiment, and paraphrasing documents.

In AlphaCode’s case, DeepMind fine-tuned and tested the system on CodeContests, a new dataset the lab created that includes problems, solutions, and test cases scraped from Codeforces with public programming datasets mixed in. DeepMind also tested the best-performing version of AlphaCode — an ensemble of the 41-billion-parameter model and a 9-billion-parameter model — on actual programming tests on Codeforces, running AlphaCode live to generate solutions for each problem.

On CodeContests, given up to a million samples per problem, AlphaCode solved 34.2% of problems. And on Codeforces, DeepMind claims it was within the top 28% of users who’ve participated in a contest within the last six months in terms of overall performance.

“The latest DeepMind paper is once again an impressive feat of engineering that shows that there are still impressive gains to be had from our current Transformer-based models with ‘just’ the right sampling and training tweaks and no fundamental changes in model architecture,” Connor Leahy, a member of the open AI research effort EleutherAI, told VentureBeat via email. “DeepMind brings out the full toolbox of tweaks and best practices by using clean data, large models, a whole suite of clever training tricks, and, of course, lots of compute. DeepMind has pushed the performance of these models far faster than even I would have expected. The 50th percentile competitive programming result is a huge leap, and their analysis shows clearly that this is not ‘just memorization.’ The progress in coding models from GPT3 to codex to AlphaCode has truly been staggeringly fast.”

Limitations of code generation​

Machine programming is by no stretch a solved science, and DeepMind admits that AlphaCode has limitations. For example, the system doesn’t always produce code that’s syntactically correct for each language, particularly in C++. AlphaCode also performs worse at generating challenging code, such as that required for dynamic programming, a technique for solving complex mathematical problems.

AlphaCode might be problematic in other ways, as well. While DeepMind didn’t probe the model for bias, code-generating models including Codex have been shown to amplify toxic and flawed content in training datasets. For example, Codex can be prompted to write “terrorist” when fed the word “Islam,” and generate code that appears to be superficially correct but poses a security risk by invoking compromised software and using insecure configurations.

Systems like AlphaCode — which, it should be noted, are expensive to produce and maintain — could also be misused, as recent studies have explored. Researchers at Booz Allen Hamilton and EleutherAI trained a language model called GPT-J to generate code that could solve introductory computer science exercises, successfully bypassing a widely-used programming plagiarism detection software. At the University of Maryland, researchers discovered that it’s possible for current language models to generate false cybersecurity reports that are convincing enough to fool leading experts.

It’s an open question whether malicious actors will use these types of systems in the future to automate malware creation at scale. For that reason, Mike Cook, an AI researcher at Queen Mary University of London, disputes the idea that AlphaCode brings the industry closer to “a problem-solving AI.”

“I think this result isn’t too surprising given that text comprehension and code generation are two of the four big tasks AI have been showing improvements at in recent years … One challenge with this domain is that outputs tend to be fairly sensitive to failure. A wrong word or pixel or musical note in an AI-generated story, artwork, or melody might not ruin the whole thing for us, but a single missed test case in a program can bring down space shuttles and destroy economies,” Cook told VentureBeat via email. “So although the idea of giving the power of programming to people who can’t program is exciting, we’ve got a lot of problems to solve before we get there.”

If DeepMind can solve these problems — and that’s a big if — it stands to make a cozy profit in a constantly-growing market. Of the practical domains the lab has recently tackled with AI, like weather forecasting, materials modeling, atomic energy computation, app recommendations, and datacenter cooling optimization, programming is among the most lucrative. Even migrating an existing codebase to a more efficient language like Java or C++ commands a princely sum. For example, the Commonwealth Bank of Australia spent around $750 million over the course of five years to convert its platform from COBOL to Java.

“I can safely say the results of AlphaCode exceeded my expectations. I was skeptical because even in simple competitive problems it is often required not only to implement the algorithm, but also (and this is the most difficult part) to invent it,” Codeforces founder Mike Mirzayanov said in a statement. “AlphaCode managed to perform at the level of a promising new competitor. I can’t wait to see what lies ahead.”
 
Time for Boeing to save even more money by not hiring real programmers and having more planes run into the ground.
 
Programming will now be less about actual programming and more about rummaging through AI generated gibberish.

Still overworked and underpaid while HR and Marketing take all the money, and when something breaks nobody knows why.
 
even STEM is being automated away, college really is worthless.

For example, Codex can be prompted to write “terrorist” when fed the word “Islam,” and generate code that appears to be superficially correct but poses a security risk by invoking compromised software and using insecure configurations.
Don’t worry about it, the learning machine is too racist to pass the corporate sniff test.
 
For example, the system doesn’t always produce code that’s syntactically correct for each language, particularly in C++.
Good to see that not even the machines bother with making sure the code actually compiles before pushing to prod. Based.

AlphaCode also performs worse at generating challenging code, such as that required for dynamic programming, a technique for solving complex mathematical problems.
>dynamic programming
>challenging code

I think we're safe for a little while longer, friends.
 
Still overworked and underpaid while HR and Marketing take all the money, and when something breaks nobody knows why.
Been a couple of years now; but I remember seeing a job posting offering $250,000/year + benefits for anyone who knows Ruby. Apparently there's plenty of big companies out there who refuse to update their old mainframes, and would rather find a needle in a haystack and pay that needle very well, instead of updating shit that should've been retired 50 years ago. Personal anecdote, remembering sitting in on a meeting where we had to ask for money to update stuff; and literally every C-Level in the office was looking at my boss like he's a leper and asking hundreds of questions about "Why can't it just work?" Even when going the logical route of "We need X-Software for compliance, X-Software won't run on our outdated shit so we need Y-Hardware, if we don't get X-Software and Y-Hardware, we're out of compliance, and if the board rolls through and sees this shit, WHICH THEY WILL, the company will be penalized hundreds-of-thousands of dollars a day until it's fixed, and this isn't something that can be fixed in a day." - "So why can't you just make it work?"
 
This is kind of like how you can teach a robot to build a house. Most houses, at their core, are just 2x4s nailed together and then drywall nailed onto that. Easy, rote, the perfect job for a machine. The hard part is when you get into the details. Wiring, plumbing, non-standard openings and angles, custom pieces. They make up a small part of the actual house, but do them wrong and shit gets bad in a hurry.

The best AI can do is to copy. It can't innovate, it can't be creative, and it certainly can't interpret design instructions unless they're written in a very specific, very detailed way. The kind of code you'd get out of something like this would be India's finest spaghetti.

DeepMind claims that AlphaCode achieved an average ranking within the top 54.3% across 10 recent contests with more than 5,000 participants each.
I don't claim to be a great, or even good programmer, but I will say that this is not very impressive because of how many incredibly shitty coders there are. The bottom 50% of code isn't even usable, and about half of the top 50% only works by accident. Even on Stack Overflow, the supposed meeting place of top programmers, most of the highly upvoted code you'll find literally does not work at all, and what does work is convoluted garbage.

Top half would be an achievement in something like surgery or civil engineering, but when it comes to coding, if you're in the top half it just means you're literate.
 
HERE'S A SHITTY PROTOTYPE THAT DOES ONLY HALF OF WHAT WE CLAIM IT DOES, AND IS ONLY CONSIDERED "SUCCESSFUL" BY OUR OWN VERY NARROW DEFINITION OF WHAT CONSTITUTES SUCCESS. PLEASE GIVE US YOUR MONEY SO WE CAN STEAL IT INNOVATE.
t. almost every tech company since 2010
 
The actual research paper is here:

So the algorithm is:
1. Feed all of Github into an AI (in a dress rehearsal for "I Have No Mouth And I Must Scream")
2. Add in additional data on coding contest winners, which some human has already tagged with metadata about what kind of algorithm is being used, etc
3. Use the AI to generate a few million potential solutions based on the problem statement
4. Throw out all the ones that don't pass the example tests
5. Filter the remaining few thousand programs down further by various rules of thumb, to eliminate duplicates or unlikely solutions.

So... yeah, OK, but this seems like a lot of work to solve one code-golf problem.
There's something inelegant in my eyes of having an an AI just spit out programs with a 99.9% failure rate and just hoping there's some gold buried in that other 0.1%.

The best AI can do is to copy. It can't innovate, it can't be creative, and it certainly can't interpret design instructions unless they're written in a very specific, very detailed way. The kind of code you'd get out of something like this would be India's finest spaghetti.
They have the AI reading the instructions from the code contest page the same as a human would (example on page 4), and that much is impressive. You can see examples of the code produced in the paper (page 23)
As to how much is just copied from the input, see page 21 for their opinion on the issue.
 
Last edited:
This poor journo seems pretty out of their element regurgitating this press release.

This thing is only going to be as good as the person ordering it, it's just another abstraction layer like a high level language over machine code. I bet it's like an AI powered snippet library at best "Okay give me a loop over variable X, then if X is less than Y.... Write a method called Foo to convert Y to metric from imperial."

The devil is in the details, it's easy to say "I want a booking app" but that doesn't produce the app you want, that just gets you cookie cutter shit that you don't need a programmer for. There is so much more that needs to be thought about. When are you closed? what is the min/max booking time? What are the prices, what are the availability rules? is there a wait list? who approves the bookings and how? How do you want to see and sort the bookings, where and how do you want to store them.... That's what programming is, tackling problems at the autistic level because machines are autistic AF and they can't read your mind.

It might make programmers lives slightly easier, maybe prevent a case or two of carpal-tunnel syndrom but it won't replace them.
 
These coding AI things have so far been glorified autocomplete/snippets, are typically a subscription service and also slow shit down cause it has to query the AI each time. I gave one a try, and it will spit out commonly used constructs for stuff, but you end up needing to double check it all anyway, it ain’t very useful.

As for getting AI’s to do complex tasks, it’s not going to happen for a while, it’s not really possible at this point to give an AI a list of requirements without first having it generate the entire output. It’d be like asking some rando to build a house, giving them a picture of a house, and then only telling them what is wrong after they’ve built an entire house.
 
The actual research paper is here:

So the algorithm is:
1. Feed all of Github into an AI (in a dress rehearsal for "I Have No Mouth And I Must Scream")
2. Add in additional data on coding contest winners, which some human has already tagged with metadata about what kind of algorithm is being used, etc
3. Use the AI to generate a few million potential solutions based on the problem statement
4. Throw out all the ones that don't pass the example tests
5. Filter the remaining few thousand programs down further by various rules of thumb, to eliminate duplicates or unlikely solutions.

So... yeah, OK, but this seems like a lot of work to solve one code-golf problem.
There's something inelegant in my eyes of having an an AI just spit out programs with a 99.9% failure rate and just hoping there's some gold buried in that other 0.1%.


They have the AI reading the instructions from the code contest page the same as a human would (example on page 4), and that much is impressive. You can see examples of the code produced in the paper (page 23)
As to how much is just copied from the input, see page 21 for their opinion on the issue.
Its ability to read the instructions is probably the most impressive thing here, although as you said it barely seems to actually be solving the problem on purpose. It's generating so much random garbage that, by sheer chance, some of that garbage ends up being a middle-tier solution. It's monkeys on typewriters. So is it actually reading the instructions? How would we even know?

As for the copying, I don't literally mean copy-pasting existing code. I mean taking existing code and changing around some names so that it's "unique" even though it's really not. If I downloaded some code and simply refactored the name of everything, this paper would consider my code exceptionally unique because they're just string matching, and my code wouldn't match with anything else save for tiny snippets.
 
Give 10 monkeys a typewriter and they'll compete with the average pajeet.
 
I learned this in my CIS class, and as someone with a username like mine, AI can only do so much but automate computer programmer jobs. They won’t even touch the computer engineer jobs that create most of these new advancements.

”Deep Mind” is just a funny trend, but an interesting phase to say the least.
 
The cool thing about result driven ideas is you can just show the result and not endlessly pontificate about how your great idea actually works in reality.
 
Isn't this just moving the skillset from coding programs to setting up AI instructions? The system needs instructions and parameters, and then someone needs to audit the result, right? Who am I kidding. They're gonna hire pajeets to write instructions and then blindly trust the output.
 
Back
Top Bottom